2023-06-23 17:26:52,792 INFO [train.py:1064] (0/4) Training started 2023-06-23 17:26:52,798 INFO [train.py:1074] (0/4) Device: cuda:0 2023-06-23 17:26:55,729 INFO [lexicon.py:168] (0/4) Loading pre-compiled data/lang_char/Linv.pt 2023-06-23 17:26:56,338 INFO [train.py:1085] (0/4) {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.1', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'c51a0b9684442a88ee37f3ce0af686a04b66855b', 'k2-git-date': 'Mon May 1 21:38:03 2023', 'lhotse-version': '1.14.0.dev+git.0f812851.dirty', 'torch-version': '1.10.0+cu102', 'torch-cuda-available': True, 'torch-cuda-version': '10.2', 'python-version': '3.8', 'icefall-git-branch': 'zipformer_wenetspeech', 'icefall-git-sha1': '63e53ba-dirty', 'icefall-git-date': 'Wed Jun 21 18:13:24 2023', 'icefall-path': '/star-kw/kangwei/code/icefall_wenetspeech', 'k2-path': '/ceph-hw/kangwei/code/k2_release/k2/k2/python/k2/__init__.py', 'lhotse-path': '/ceph-hw/kangwei/dev_tools/anaconda3/envs/rnnt2/lib/python3.8/site-packages/lhotse-1.14.0.dev0+git.0f812851.dirty-py3.8.egg/lhotse/__init__.py', 'hostname': 'de-74279-k2-train-6-0423201309-7c68fd68fb-6cszs', 'IP address': '10.177.28.83'}, 'world_size': 4, 'master_port': 12536, 'tensorboard': True, 'num_epochs': 12, 'start_epoch': 6, 'start_batch': 0, 'exp_dir': PosixPath('zipformer/exp_L_small'), 'lang_dir': PosixPath('data/lang_char'), 'base_lr': 0.045, 'lr_batches': 7500, 'lr_epochs': 1.5, 'ref_duration': 600, 'context_size': 2, 'prune_range': 5, 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'seed': 42, 'print_diagnostics': False, 'inf_check': False, 'save_every_n': 4000, 'keep_last_k': 30, 'average_period': 200, 'use_fp16': True, 'num_encoder_layers': '2,2,2,2,2,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,768,768,768,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,256,256,256,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,192,192,192,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 900, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 8, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'training_subset': 'L', 'blank_id': 0, 'vocab_size': 5537} 2023-06-23 17:26:56,339 INFO [train.py:1087] (0/4) About to create model 2023-06-23 17:26:57,116 INFO [train.py:1091] (0/4) Number of model parameters: 32327030 2023-06-23 17:26:57,685 INFO [checkpoint.py:112] (0/4) Loading checkpoint from zipformer/exp_L_small/epoch-5.pt 2023-06-23 17:26:59,905 INFO [checkpoint.py:131] (0/4) Loading averaged model 2023-06-23 17:27:08,902 INFO [train.py:1106] (0/4) Using DDP 2023-06-23 17:27:09,623 INFO [train.py:1118] (0/4) Loading optimizer state dict 2023-06-23 17:27:10,104 INFO [train.py:1126] (0/4) Loading scheduler state dict 2023-06-23 17:27:10,105 INFO [asr_datamodule.py:390] (0/4) About to get train cuts 2023-06-23 17:27:10,116 INFO [asr_datamodule.py:398] (0/4) About to get dev cuts 2023-06-23 17:27:10,128 INFO [asr_datamodule.py:211] (0/4) About to get Musan cuts 2023-06-23 17:27:13,858 INFO [asr_datamodule.py:216] (0/4) Enable MUSAN 2023-06-23 17:27:13,859 INFO [asr_datamodule.py:239] (0/4) Enable SpecAugment 2023-06-23 17:27:13,859 INFO [asr_datamodule.py:240] (0/4) Time warp factor: 80 2023-06-23 17:27:13,859 INFO [asr_datamodule.py:250] (0/4) Num frame mask: 10 2023-06-23 17:27:13,860 INFO [asr_datamodule.py:263] (0/4) About to create train dataset 2023-06-23 17:27:13,860 INFO [asr_datamodule.py:289] (0/4) Using DynamicBucketingSampler. 2023-06-23 17:27:19,128 INFO [asr_datamodule.py:305] (0/4) About to create train dataloader 2023-06-23 17:27:19,130 INFO [asr_datamodule.py:336] (0/4) About to create dev dataset 2023-06-23 17:27:20,093 INFO [asr_datamodule.py:354] (0/4) About to create dev dataloader 2023-06-23 17:27:20,094 INFO [train.py:1206] (0/4) Loading grad scaler state dict 2023-06-23 17:29:33,345 INFO [train.py:996] (0/4) Epoch 6, batch 0, loss[loss=0.219, simple_loss=0.2752, pruned_loss=0.08142, over 21294.00 frames. ], tot_loss[loss=0.219, simple_loss=0.2752, pruned_loss=0.08142, over 21294.00 frames. ], batch size: 177, lr: 5.35e-03, grad_scale: 32.0 2023-06-23 17:29:33,346 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-23 17:29:50,958 INFO [train.py:1028] (0/4) Epoch 6, validation: loss=0.2383, simple_loss=0.345, pruned_loss=0.06586, over 1796401.00 frames. 2023-06-23 17:29:50,959 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 21828MB 2023-06-23 17:30:05,497 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 17:30:24,646 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.81 vs. limit=22.5 2023-06-23 17:30:28,636 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.741e+02 4.794e+02 6.251e+02 8.348e+02 2.118e+03, threshold=1.250e+03, percent-clipped=42.0 2023-06-23 17:31:11,051 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=915018.0, ans=0.1 2023-06-23 17:31:35,970 INFO [train.py:996] (0/4) Epoch 6, batch 50, loss[loss=0.323, simple_loss=0.3978, pruned_loss=0.1241, over 21489.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3111, pruned_loss=0.07865, over 962993.18 frames. ], batch size: 471, lr: 5.35e-03, grad_scale: 16.0 2023-06-23 17:32:41,230 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.00 vs. limit=15.0 2023-06-23 17:32:58,871 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=915318.0, ans=0.1 2023-06-23 17:33:14,090 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=915378.0, ans=0.1 2023-06-23 17:33:14,109 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=915378.0, ans=0.09899494936611666 2023-06-23 17:33:21,860 INFO [train.py:996] (0/4) Epoch 6, batch 100, loss[loss=0.2489, simple_loss=0.3548, pruned_loss=0.07151, over 21443.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.3233, pruned_loss=0.0804, over 1686667.08 frames. ], batch size: 211, lr: 5.34e-03, grad_scale: 16.0 2023-06-23 17:33:33,671 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=915438.0, ans=0.125 2023-06-23 17:33:43,626 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=915498.0, ans=0.125 2023-06-23 17:33:43,709 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=915498.0, ans=0.125 2023-06-23 17:34:00,740 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.86 vs. limit=15.0 2023-06-23 17:34:04,536 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.886e+02 2.333e+02 2.600e+02 2.995e+02 4.991e+02, threshold=5.199e+02, percent-clipped=0.0 2023-06-23 17:34:06,481 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=915558.0, ans=0.125 2023-06-23 17:35:09,830 INFO [train.py:996] (0/4) Epoch 6, batch 150, loss[loss=0.2504, simple_loss=0.3344, pruned_loss=0.0832, over 21241.00 frames. ], tot_loss[loss=0.2468, simple_loss=0.3305, pruned_loss=0.08159, over 2267920.37 frames. ], batch size: 143, lr: 5.34e-03, grad_scale: 16.0 2023-06-23 17:36:36,149 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=915918.0, ans=0.0 2023-06-23 17:36:40,348 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.00 vs. limit=6.0 2023-06-23 17:36:59,859 INFO [train.py:996] (0/4) Epoch 6, batch 200, loss[loss=0.3153, simple_loss=0.3698, pruned_loss=0.1304, over 21387.00 frames. ], tot_loss[loss=0.2457, simple_loss=0.3276, pruned_loss=0.08194, over 2700422.84 frames. ], batch size: 471, lr: 5.34e-03, grad_scale: 16.0 2023-06-23 17:37:13,483 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.63 vs. limit=15.0 2023-06-23 17:37:13,502 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.53 vs. limit=15.0 2023-06-23 17:37:17,986 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=916098.0, ans=0.125 2023-06-23 17:37:40,264 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.885e+02 2.585e+02 2.985e+02 3.639e+02 6.609e+02, threshold=5.970e+02, percent-clipped=4.0 2023-06-23 17:38:27,711 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=916278.0, ans=0.125 2023-06-23 17:38:35,294 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.85 vs. limit=15.0 2023-06-23 17:38:47,089 INFO [train.py:996] (0/4) Epoch 6, batch 250, loss[loss=0.2484, simple_loss=0.3269, pruned_loss=0.08498, over 21594.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3249, pruned_loss=0.08161, over 3051094.56 frames. ], batch size: 389, lr: 5.34e-03, grad_scale: 16.0 2023-06-23 17:38:47,786 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=916338.0, ans=0.125 2023-06-23 17:39:11,058 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=916398.0, ans=0.0 2023-06-23 17:39:26,397 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=916398.0, ans=0.2 2023-06-23 17:39:27,988 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=916458.0, ans=0.125 2023-06-23 17:40:00,739 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=916518.0, ans=0.1 2023-06-23 17:40:28,683 INFO [train.py:996] (0/4) Epoch 6, batch 300, loss[loss=0.1812, simple_loss=0.2476, pruned_loss=0.05739, over 21369.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.3202, pruned_loss=0.08134, over 3317940.67 frames. ], batch size: 131, lr: 5.34e-03, grad_scale: 16.0 2023-06-23 17:40:35,259 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=916638.0, ans=0.0 2023-06-23 17:41:00,660 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=916698.0, ans=0.035 2023-06-23 17:41:08,940 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.075e+02 2.631e+02 3.060e+02 3.627e+02 5.054e+02, threshold=6.120e+02, percent-clipped=0.0 2023-06-23 17:41:45,599 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=916818.0, ans=0.1 2023-06-23 17:42:21,733 INFO [train.py:996] (0/4) Epoch 6, batch 350, loss[loss=0.2198, simple_loss=0.316, pruned_loss=0.06182, over 21740.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3137, pruned_loss=0.07982, over 3536830.02 frames. ], batch size: 124, lr: 5.34e-03, grad_scale: 16.0 2023-06-23 17:43:12,136 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.51 vs. limit=6.0 2023-06-23 17:43:27,861 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=917058.0, ans=0.125 2023-06-23 17:43:33,476 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=917118.0, ans=0.0 2023-06-23 17:43:38,797 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=917118.0, ans=0.125 2023-06-23 17:44:07,648 INFO [train.py:996] (0/4) Epoch 6, batch 400, loss[loss=0.2347, simple_loss=0.3535, pruned_loss=0.05794, over 20803.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3078, pruned_loss=0.07746, over 3698708.31 frames. ], batch size: 608, lr: 5.34e-03, grad_scale: 32.0 2023-06-23 17:44:17,144 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 17:44:28,482 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=917238.0, ans=0.2 2023-06-23 17:44:47,066 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.014e+02 2.687e+02 2.996e+02 3.462e+02 5.169e+02, threshold=5.992e+02, percent-clipped=0.0 2023-06-23 17:45:14,576 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=917358.0, ans=0.07 2023-06-23 17:45:55,313 INFO [train.py:996] (0/4) Epoch 6, batch 450, loss[loss=0.1686, simple_loss=0.2491, pruned_loss=0.04406, over 21299.00 frames. ], tot_loss[loss=0.2276, simple_loss=0.3027, pruned_loss=0.07631, over 3831426.26 frames. ], batch size: 176, lr: 5.34e-03, grad_scale: 32.0 2023-06-23 17:46:24,078 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.13 vs. limit=15.0 2023-06-23 17:46:27,145 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.96 vs. limit=15.0 2023-06-23 17:46:30,247 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=917598.0, ans=0.125 2023-06-23 17:47:23,037 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=917718.0, ans=0.125 2023-06-23 17:47:38,358 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=917778.0, ans=0.125 2023-06-23 17:47:46,331 INFO [train.py:996] (0/4) Epoch 6, batch 500, loss[loss=0.2064, simple_loss=0.2672, pruned_loss=0.0728, over 21703.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.3045, pruned_loss=0.07564, over 3938991.12 frames. ], batch size: 112, lr: 5.34e-03, grad_scale: 16.0 2023-06-23 17:47:47,831 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.05 vs. limit=6.0 2023-06-23 17:48:13,989 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=917898.0, ans=15.0 2023-06-23 17:48:15,238 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=917898.0, ans=0.0 2023-06-23 17:48:32,681 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.054e+02 2.519e+02 2.896e+02 3.744e+02 5.708e+02, threshold=5.793e+02, percent-clipped=0.0 2023-06-23 17:48:58,037 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=918018.0, ans=0.125 2023-06-23 17:49:30,901 INFO [train.py:996] (0/4) Epoch 6, batch 550, loss[loss=0.2179, simple_loss=0.2906, pruned_loss=0.07259, over 19935.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.307, pruned_loss=0.07489, over 4021145.86 frames. ], batch size: 704, lr: 5.34e-03, grad_scale: 16.0 2023-06-23 17:49:43,837 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=918138.0, ans=0.0 2023-06-23 17:50:35,905 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=918258.0, ans=0.2 2023-06-23 17:51:15,206 INFO [train.py:996] (0/4) Epoch 6, batch 600, loss[loss=0.2307, simple_loss=0.3028, pruned_loss=0.07927, over 21366.00 frames. ], tot_loss[loss=0.228, simple_loss=0.3074, pruned_loss=0.0743, over 4076144.14 frames. ], batch size: 176, lr: 5.34e-03, grad_scale: 16.0 2023-06-23 17:51:15,847 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=918438.0, ans=0.0 2023-06-23 17:51:25,321 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.12 vs. limit=6.0 2023-06-23 17:51:47,084 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.98 vs. limit=15.0 2023-06-23 17:52:12,356 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.806e+02 2.707e+02 3.073e+02 3.854e+02 5.945e+02, threshold=6.147e+02, percent-clipped=1.0 2023-06-23 17:52:13,428 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=918558.0, ans=0.2 2023-06-23 17:53:04,187 INFO [train.py:996] (0/4) Epoch 6, batch 650, loss[loss=0.2435, simple_loss=0.3085, pruned_loss=0.08926, over 21909.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3078, pruned_loss=0.07443, over 4128943.77 frames. ], batch size: 414, lr: 5.34e-03, grad_scale: 16.0 2023-06-23 17:53:05,421 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.96 vs. limit=15.0 2023-06-23 17:53:43,380 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=918798.0, ans=0.125 2023-06-23 17:54:06,622 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.65 vs. limit=22.5 2023-06-23 17:54:37,019 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=918978.0, ans=0.1 2023-06-23 17:54:45,852 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=919038.0, ans=0.025 2023-06-23 17:54:46,844 INFO [train.py:996] (0/4) Epoch 6, batch 700, loss[loss=0.2419, simple_loss=0.375, pruned_loss=0.0544, over 19743.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3094, pruned_loss=0.07586, over 4160546.27 frames. ], batch size: 703, lr: 5.33e-03, grad_scale: 16.0 2023-06-23 17:55:38,550 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.091e+02 2.507e+02 2.938e+02 3.548e+02 4.696e+02, threshold=5.875e+02, percent-clipped=0.0 2023-06-23 17:56:35,827 INFO [train.py:996] (0/4) Epoch 6, batch 750, loss[loss=0.2445, simple_loss=0.3254, pruned_loss=0.08185, over 21651.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3087, pruned_loss=0.077, over 4188235.38 frames. ], batch size: 230, lr: 5.33e-03, grad_scale: 16.0 2023-06-23 17:57:05,474 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=919398.0, ans=0.95 2023-06-23 17:57:57,746 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=919518.0, ans=0.1 2023-06-23 17:58:10,175 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=919578.0, ans=0.1 2023-06-23 17:58:23,880 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=919638.0, ans=0.0 2023-06-23 17:58:24,801 INFO [train.py:996] (0/4) Epoch 6, batch 800, loss[loss=0.214, simple_loss=0.2876, pruned_loss=0.07023, over 21707.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.3052, pruned_loss=0.07709, over 4203972.19 frames. ], batch size: 298, lr: 5.33e-03, grad_scale: 32.0 2023-06-23 17:58:39,084 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=919638.0, ans=0.2 2023-06-23 17:58:44,235 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=919698.0, ans=0.0 2023-06-23 17:59:00,312 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=919698.0, ans=0.1 2023-06-23 17:59:03,456 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=919758.0, ans=0.125 2023-06-23 17:59:04,541 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.034e+02 2.561e+02 2.955e+02 3.550e+02 6.098e+02, threshold=5.911e+02, percent-clipped=2.0 2023-06-23 17:59:17,688 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=919758.0, ans=0.125 2023-06-23 17:59:58,656 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=919878.0, ans=0.0 2023-06-23 18:00:08,610 INFO [train.py:996] (0/4) Epoch 6, batch 850, loss[loss=0.2178, simple_loss=0.3465, pruned_loss=0.0446, over 20798.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3033, pruned_loss=0.07643, over 4217470.01 frames. ], batch size: 608, lr: 5.33e-03, grad_scale: 32.0 2023-06-23 18:00:14,762 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=919938.0, ans=0.125 2023-06-23 18:00:21,694 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=919938.0, ans=0.0 2023-06-23 18:00:40,356 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=919998.0, ans=0.125 2023-06-23 18:01:24,900 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=920118.0, ans=0.125 2023-06-23 18:01:59,445 INFO [train.py:996] (0/4) Epoch 6, batch 900, loss[loss=0.2413, simple_loss=0.3161, pruned_loss=0.08325, over 21851.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3011, pruned_loss=0.07587, over 4232570.79 frames. ], batch size: 371, lr: 5.33e-03, grad_scale: 16.0 2023-06-23 18:02:27,491 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=920298.0, ans=0.2 2023-06-23 18:02:29,316 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=920298.0, ans=0.0 2023-06-23 18:02:53,515 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.973e+02 2.549e+02 3.030e+02 3.332e+02 5.799e+02, threshold=6.061e+02, percent-clipped=0.0 2023-06-23 18:03:15,718 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=920418.0, ans=0.125 2023-06-23 18:03:32,185 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.73 vs. limit=15.0 2023-06-23 18:03:33,255 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=920478.0, ans=0.0 2023-06-23 18:03:50,042 INFO [train.py:996] (0/4) Epoch 6, batch 950, loss[loss=0.1717, simple_loss=0.2501, pruned_loss=0.04665, over 21290.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.2985, pruned_loss=0.07567, over 4249526.52 frames. ], batch size: 176, lr: 5.33e-03, grad_scale: 16.0 2023-06-23 18:05:21,018 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=920718.0, ans=0.125 2023-06-23 18:05:38,371 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=920778.0, ans=0.125 2023-06-23 18:05:41,240 INFO [train.py:996] (0/4) Epoch 6, batch 1000, loss[loss=0.1972, simple_loss=0.2547, pruned_loss=0.06986, over 21193.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.2977, pruned_loss=0.07565, over 4260156.68 frames. ], batch size: 548, lr: 5.33e-03, grad_scale: 16.0 2023-06-23 18:05:57,164 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=920838.0, ans=0.125 2023-06-23 18:06:42,106 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.855e+02 2.583e+02 2.913e+02 3.407e+02 5.854e+02, threshold=5.827e+02, percent-clipped=0.0 2023-06-23 18:06:55,884 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.72 vs. limit=22.5 2023-06-23 18:07:26,345 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=921078.0, ans=0.2 2023-06-23 18:07:32,545 INFO [train.py:996] (0/4) Epoch 6, batch 1050, loss[loss=0.2379, simple_loss=0.3287, pruned_loss=0.07358, over 21793.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.2988, pruned_loss=0.07578, over 4271689.27 frames. ], batch size: 371, lr: 5.33e-03, grad_scale: 16.0 2023-06-23 18:08:01,568 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 18:08:20,718 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 18:08:49,712 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=921318.0, ans=0.125 2023-06-23 18:08:57,127 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=921318.0, ans=0.0 2023-06-23 18:08:58,879 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=921318.0, ans=0.0 2023-06-23 18:09:31,265 INFO [train.py:996] (0/4) Epoch 6, batch 1100, loss[loss=0.1882, simple_loss=0.2755, pruned_loss=0.0504, over 21639.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.2981, pruned_loss=0.07455, over 4270801.60 frames. ], batch size: 263, lr: 5.33e-03, grad_scale: 16.0 2023-06-23 18:09:37,414 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=921438.0, ans=0.1 2023-06-23 18:10:25,290 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.974e+02 2.670e+02 3.079e+02 4.028e+02 7.418e+02, threshold=6.158e+02, percent-clipped=6.0 2023-06-23 18:10:57,234 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 18:11:14,835 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.98 vs. limit=15.0 2023-06-23 18:11:29,969 INFO [train.py:996] (0/4) Epoch 6, batch 1150, loss[loss=0.2028, simple_loss=0.2685, pruned_loss=0.06859, over 16664.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.2997, pruned_loss=0.07465, over 4275063.01 frames. ], batch size: 60, lr: 5.33e-03, grad_scale: 16.0 2023-06-23 18:12:16,402 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=921858.0, ans=0.0 2023-06-23 18:13:07,325 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=921978.0, ans=0.125 2023-06-23 18:13:17,181 INFO [train.py:996] (0/4) Epoch 6, batch 1200, loss[loss=0.2068, simple_loss=0.2655, pruned_loss=0.07405, over 21188.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.301, pruned_loss=0.0757, over 4278730.80 frames. ], batch size: 608, lr: 5.33e-03, grad_scale: 32.0 2023-06-23 18:13:19,671 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=922038.0, ans=0.125 2023-06-23 18:13:59,212 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.082e+02 2.616e+02 3.018e+02 3.638e+02 5.698e+02, threshold=6.035e+02, percent-clipped=0.0 2023-06-23 18:14:28,116 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=922218.0, ans=0.1 2023-06-23 18:14:53,624 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=922278.0, ans=0.125 2023-06-23 18:15:07,509 INFO [train.py:996] (0/4) Epoch 6, batch 1250, loss[loss=0.2331, simple_loss=0.3019, pruned_loss=0.08214, over 21841.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3043, pruned_loss=0.07763, over 4276588.38 frames. ], batch size: 107, lr: 5.32e-03, grad_scale: 32.0 2023-06-23 18:15:26,275 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=922338.0, ans=0.125 2023-06-23 18:15:52,762 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.37 vs. limit=12.0 2023-06-23 18:16:58,757 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=922638.0, ans=0.0 2023-06-23 18:16:59,816 INFO [train.py:996] (0/4) Epoch 6, batch 1300, loss[loss=0.2018, simple_loss=0.273, pruned_loss=0.06531, over 21435.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.306, pruned_loss=0.07812, over 4277904.17 frames. ], batch size: 131, lr: 5.32e-03, grad_scale: 32.0 2023-06-23 18:17:14,818 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=922638.0, ans=0.125 2023-06-23 18:17:29,503 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=922698.0, ans=0.0 2023-06-23 18:17:42,853 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.046e+02 2.762e+02 3.245e+02 4.001e+02 7.520e+02, threshold=6.490e+02, percent-clipped=2.0 2023-06-23 18:17:54,851 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=922758.0, ans=0.0 2023-06-23 18:18:29,537 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=922878.0, ans=10.0 2023-06-23 18:18:31,027 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=922878.0, ans=0.2 2023-06-23 18:18:46,510 INFO [train.py:996] (0/4) Epoch 6, batch 1350, loss[loss=0.2161, simple_loss=0.3025, pruned_loss=0.0649, over 21636.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3068, pruned_loss=0.07783, over 4274083.09 frames. ], batch size: 230, lr: 5.32e-03, grad_scale: 32.0 2023-06-23 18:19:12,009 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=922998.0, ans=0.0 2023-06-23 18:19:44,591 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=923058.0, ans=0.125 2023-06-23 18:19:58,684 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=923118.0, ans=0.05 2023-06-23 18:20:36,684 INFO [train.py:996] (0/4) Epoch 6, batch 1400, loss[loss=0.2146, simple_loss=0.2841, pruned_loss=0.07253, over 21800.00 frames. ], tot_loss[loss=0.231, simple_loss=0.305, pruned_loss=0.07845, over 4279696.86 frames. ], batch size: 98, lr: 5.32e-03, grad_scale: 32.0 2023-06-23 18:21:10,781 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=923298.0, ans=0.1 2023-06-23 18:21:12,816 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=923298.0, ans=0.0 2023-06-23 18:21:21,168 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.008e+02 2.458e+02 2.680e+02 3.185e+02 5.161e+02, threshold=5.361e+02, percent-clipped=0.0 2023-06-23 18:21:42,371 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=923418.0, ans=0.125 2023-06-23 18:22:35,798 INFO [train.py:996] (0/4) Epoch 6, batch 1450, loss[loss=0.2173, simple_loss=0.2766, pruned_loss=0.07904, over 21650.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3058, pruned_loss=0.07936, over 4278258.51 frames. ], batch size: 415, lr: 5.32e-03, grad_scale: 16.0 2023-06-23 18:23:04,694 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=923598.0, ans=0.125 2023-06-23 18:23:15,422 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=923658.0, ans=0.0 2023-06-23 18:23:40,333 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.96 vs. limit=15.0 2023-06-23 18:23:52,528 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=923718.0, ans=0.0 2023-06-23 18:24:26,510 INFO [train.py:996] (0/4) Epoch 6, batch 1500, loss[loss=0.2355, simple_loss=0.303, pruned_loss=0.08401, over 21336.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3081, pruned_loss=0.08011, over 4279661.87 frames. ], batch size: 159, lr: 5.32e-03, grad_scale: 16.0 2023-06-23 18:24:30,801 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=923838.0, ans=0.0 2023-06-23 18:25:06,058 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.940e+02 2.614e+02 2.900e+02 3.425e+02 5.180e+02, threshold=5.801e+02, percent-clipped=0.0 2023-06-23 18:25:22,058 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=923958.0, ans=0.0 2023-06-23 18:26:08,264 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=924078.0, ans=0.2 2023-06-23 18:26:20,377 INFO [train.py:996] (0/4) Epoch 6, batch 1550, loss[loss=0.2124, simple_loss=0.2876, pruned_loss=0.0686, over 21494.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3063, pruned_loss=0.07879, over 4279548.44 frames. ], batch size: 131, lr: 5.32e-03, grad_scale: 16.0 2023-06-23 18:28:14,004 INFO [train.py:996] (0/4) Epoch 6, batch 1600, loss[loss=0.2248, simple_loss=0.2955, pruned_loss=0.07706, over 21801.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3051, pruned_loss=0.07904, over 4275914.44 frames. ], batch size: 316, lr: 5.32e-03, grad_scale: 32.0 2023-06-23 18:28:18,656 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=924438.0, ans=0.1 2023-06-23 18:28:31,600 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=924498.0, ans=0.0 2023-06-23 18:29:08,562 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.100e+02 2.611e+02 2.907e+02 3.387e+02 5.572e+02, threshold=5.813e+02, percent-clipped=0.0 2023-06-23 18:29:29,898 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=924618.0, ans=0.125 2023-06-23 18:29:40,157 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=924618.0, ans=0.0 2023-06-23 18:29:52,814 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=924678.0, ans=0.05 2023-06-23 18:29:58,418 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=924678.0, ans=0.125 2023-06-23 18:30:08,433 INFO [train.py:996] (0/4) Epoch 6, batch 1650, loss[loss=0.2389, simple_loss=0.3049, pruned_loss=0.08643, over 21746.00 frames. ], tot_loss[loss=0.231, simple_loss=0.3041, pruned_loss=0.07889, over 4274710.27 frames. ], batch size: 389, lr: 5.32e-03, grad_scale: 16.0 2023-06-23 18:30:30,953 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.78 vs. limit=12.0 2023-06-23 18:30:51,659 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=924798.0, ans=0.125 2023-06-23 18:31:20,882 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=924918.0, ans=0.125 2023-06-23 18:32:02,892 INFO [train.py:996] (0/4) Epoch 6, batch 1700, loss[loss=0.2451, simple_loss=0.3197, pruned_loss=0.08523, over 21694.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3073, pruned_loss=0.07969, over 4278557.51 frames. ], batch size: 351, lr: 5.32e-03, grad_scale: 16.0 2023-06-23 18:32:16,037 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.48 vs. limit=15.0 2023-06-23 18:33:01,023 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.000e+02 2.590e+02 2.907e+02 3.447e+02 5.734e+02, threshold=5.814e+02, percent-clipped=0.0 2023-06-23 18:33:24,028 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=925218.0, ans=0.125 2023-06-23 18:33:38,798 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.06 vs. limit=15.0 2023-06-23 18:33:59,315 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=925278.0, ans=0.0 2023-06-23 18:34:02,132 INFO [train.py:996] (0/4) Epoch 6, batch 1750, loss[loss=0.1854, simple_loss=0.2759, pruned_loss=0.0475, over 21710.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3062, pruned_loss=0.07749, over 4275355.88 frames. ], batch size: 332, lr: 5.32e-03, grad_scale: 16.0 2023-06-23 18:34:28,283 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=925398.0, ans=10.0 2023-06-23 18:34:43,166 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=925398.0, ans=0.0 2023-06-23 18:34:50,553 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=925458.0, ans=0.0 2023-06-23 18:35:02,108 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=925458.0, ans=0.125 2023-06-23 18:35:09,883 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.95 vs. limit=15.0 2023-06-23 18:35:25,602 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=925518.0, ans=0.125 2023-06-23 18:35:37,753 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=925578.0, ans=0.1 2023-06-23 18:36:02,449 INFO [train.py:996] (0/4) Epoch 6, batch 1800, loss[loss=0.2131, simple_loss=0.298, pruned_loss=0.06415, over 21290.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.3049, pruned_loss=0.075, over 4274737.51 frames. ], batch size: 176, lr: 5.32e-03, grad_scale: 16.0 2023-06-23 18:36:56,006 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.930e+02 2.395e+02 2.914e+02 3.634e+02 6.423e+02, threshold=5.828e+02, percent-clipped=1.0 2023-06-23 18:37:07,293 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=925818.0, ans=0.125 2023-06-23 18:37:29,494 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=925878.0, ans=0.125 2023-06-23 18:37:39,530 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=925878.0, ans=0.125 2023-06-23 18:37:53,459 INFO [train.py:996] (0/4) Epoch 6, batch 1850, loss[loss=0.2423, simple_loss=0.3152, pruned_loss=0.08469, over 21512.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.305, pruned_loss=0.07295, over 4274942.94 frames. ], batch size: 441, lr: 5.31e-03, grad_scale: 8.0 2023-06-23 18:39:18,211 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=926118.0, ans=0.0 2023-06-23 18:39:43,537 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 18:39:46,160 INFO [train.py:996] (0/4) Epoch 6, batch 1900, loss[loss=0.2059, simple_loss=0.2813, pruned_loss=0.06528, over 21763.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.306, pruned_loss=0.07453, over 4276696.34 frames. ], batch size: 112, lr: 5.31e-03, grad_scale: 8.0 2023-06-23 18:40:39,860 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.885e+02 2.383e+02 2.644e+02 3.253e+02 4.924e+02, threshold=5.288e+02, percent-clipped=0.0 2023-06-23 18:40:44,368 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=926358.0, ans=0.1 2023-06-23 18:41:06,371 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=926418.0, ans=0.2 2023-06-23 18:41:18,428 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=926478.0, ans=0.0 2023-06-23 18:41:23,901 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=926478.0, ans=0.125 2023-06-23 18:41:37,833 INFO [train.py:996] (0/4) Epoch 6, batch 1950, loss[loss=0.2071, simple_loss=0.2649, pruned_loss=0.07465, over 21582.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.3024, pruned_loss=0.0735, over 4253215.13 frames. ], batch size: 415, lr: 5.31e-03, grad_scale: 8.0 2023-06-23 18:41:56,443 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff2.min_abs, batch_count=926538.0, ans=0.1 2023-06-23 18:42:25,952 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=926658.0, ans=0.125 2023-06-23 18:43:11,384 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=926778.0, ans=0.2 2023-06-23 18:43:32,498 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=926778.0, ans=0.125 2023-06-23 18:43:37,087 INFO [train.py:996] (0/4) Epoch 6, batch 2000, loss[loss=0.1937, simple_loss=0.2643, pruned_loss=0.0616, over 21644.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.2966, pruned_loss=0.07137, over 4264166.08 frames. ], batch size: 247, lr: 5.31e-03, grad_scale: 16.0 2023-06-23 18:44:11,480 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=926898.0, ans=0.125 2023-06-23 18:44:24,632 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.849e+02 2.599e+02 2.979e+02 3.641e+02 7.240e+02, threshold=5.958e+02, percent-clipped=3.0 2023-06-23 18:44:44,367 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=927018.0, ans=0.0 2023-06-23 18:45:28,379 INFO [train.py:996] (0/4) Epoch 6, batch 2050, loss[loss=0.1919, simple_loss=0.2653, pruned_loss=0.0592, over 21630.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.2983, pruned_loss=0.07144, over 4267307.70 frames. ], batch size: 298, lr: 5.31e-03, grad_scale: 16.0 2023-06-23 18:45:34,483 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=927138.0, ans=0.0 2023-06-23 18:45:34,598 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=927138.0, ans=0.0 2023-06-23 18:45:59,696 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=927198.0, ans=0.1 2023-06-23 18:47:10,358 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=927378.0, ans=0.125 2023-06-23 18:47:20,411 INFO [train.py:996] (0/4) Epoch 6, batch 2100, loss[loss=0.2338, simple_loss=0.3395, pruned_loss=0.06405, over 21171.00 frames. ], tot_loss[loss=0.223, simple_loss=0.2993, pruned_loss=0.0734, over 4272964.19 frames. ], batch size: 548, lr: 5.31e-03, grad_scale: 16.0 2023-06-23 18:47:38,880 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=927438.0, ans=0.1 2023-06-23 18:47:47,839 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=927498.0, ans=0.125 2023-06-23 18:48:08,466 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.892e+02 2.503e+02 2.741e+02 3.125e+02 4.918e+02, threshold=5.483e+02, percent-clipped=0.0 2023-06-23 18:48:45,599 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=927678.0, ans=0.0 2023-06-23 18:48:49,386 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=927678.0, ans=0.125 2023-06-23 18:48:58,140 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=927678.0, ans=0.0 2023-06-23 18:49:07,376 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=927678.0, ans=0.125 2023-06-23 18:49:12,078 INFO [train.py:996] (0/4) Epoch 6, batch 2150, loss[loss=0.24, simple_loss=0.2982, pruned_loss=0.09091, over 21598.00 frames. ], tot_loss[loss=0.226, simple_loss=0.3013, pruned_loss=0.07542, over 4279208.06 frames. ], batch size: 441, lr: 5.31e-03, grad_scale: 16.0 2023-06-23 18:49:16,244 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=927738.0, ans=0.1 2023-06-23 18:49:48,585 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.07 vs. limit=8.0 2023-06-23 18:50:01,089 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=2.93 vs. limit=12.0 2023-06-23 18:50:57,305 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=927978.0, ans=0.125 2023-06-23 18:50:59,096 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=928038.0, ans=0.125 2023-06-23 18:51:00,003 INFO [train.py:996] (0/4) Epoch 6, batch 2200, loss[loss=0.2578, simple_loss=0.3375, pruned_loss=0.08906, over 21459.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3045, pruned_loss=0.07681, over 4279287.19 frames. ], batch size: 211, lr: 5.31e-03, grad_scale: 16.0 2023-06-23 18:51:01,070 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.84 vs. limit=22.5 2023-06-23 18:51:47,930 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.154e+02 2.632e+02 2.959e+02 3.421e+02 5.687e+02, threshold=5.917e+02, percent-clipped=1.0 2023-06-23 18:52:08,529 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.35 vs. limit=15.0 2023-06-23 18:52:25,844 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.36 vs. limit=12.0 2023-06-23 18:52:49,697 INFO [train.py:996] (0/4) Epoch 6, batch 2250, loss[loss=0.177, simple_loss=0.2437, pruned_loss=0.05512, over 21403.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.3029, pruned_loss=0.07533, over 4281440.19 frames. ], batch size: 131, lr: 5.31e-03, grad_scale: 16.0 2023-06-23 18:53:01,294 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=928338.0, ans=0.125 2023-06-23 18:53:16,833 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=928398.0, ans=0.04949747468305833 2023-06-23 18:54:03,880 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.21 vs. limit=10.0 2023-06-23 18:54:40,524 INFO [train.py:996] (0/4) Epoch 6, batch 2300, loss[loss=0.217, simple_loss=0.2802, pruned_loss=0.07685, over 21818.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.2986, pruned_loss=0.07439, over 4278731.14 frames. ], batch size: 352, lr: 5.31e-03, grad_scale: 16.0 2023-06-23 18:54:43,070 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=928638.0, ans=0.0 2023-06-23 18:55:02,744 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 18:55:26,280 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.90 vs. limit=10.0 2023-06-23 18:55:28,463 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.991e+02 2.420e+02 2.816e+02 3.301e+02 5.962e+02, threshold=5.633e+02, percent-clipped=1.0 2023-06-23 18:55:46,464 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=928818.0, ans=0.125 2023-06-23 18:55:52,016 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=928818.0, ans=0.125 2023-06-23 18:56:00,837 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=928818.0, ans=0.07 2023-06-23 18:56:04,670 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 18:56:38,333 INFO [train.py:996] (0/4) Epoch 6, batch 2350, loss[loss=0.2484, simple_loss=0.3054, pruned_loss=0.0957, over 21254.00 frames. ], tot_loss[loss=0.222, simple_loss=0.295, pruned_loss=0.07449, over 4280845.57 frames. ], batch size: 159, lr: 5.31e-03, grad_scale: 16.0 2023-06-23 18:56:38,978 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=928938.0, ans=0.125 2023-06-23 18:57:43,700 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=929118.0, ans=0.125 2023-06-23 18:58:30,905 INFO [train.py:996] (0/4) Epoch 6, batch 2400, loss[loss=0.2571, simple_loss=0.3311, pruned_loss=0.09155, over 21718.00 frames. ], tot_loss[loss=0.227, simple_loss=0.3004, pruned_loss=0.07676, over 4278924.52 frames. ], batch size: 332, lr: 5.31e-03, grad_scale: 32.0 2023-06-23 18:59:01,535 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=929298.0, ans=0.125 2023-06-23 18:59:21,202 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.092e+02 2.599e+02 2.851e+02 3.513e+02 5.978e+02, threshold=5.701e+02, percent-clipped=2.0 2023-06-23 19:00:22,707 INFO [train.py:996] (0/4) Epoch 6, batch 2450, loss[loss=0.2216, simple_loss=0.2908, pruned_loss=0.0762, over 15213.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3045, pruned_loss=0.07967, over 4269949.82 frames. ], batch size: 60, lr: 5.30e-03, grad_scale: 16.0 2023-06-23 19:00:37,148 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=929538.0, ans=0.0 2023-06-23 19:00:55,272 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.11 vs. limit=6.0 2023-06-23 19:00:56,351 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 19:01:11,253 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=929658.0, ans=0.125 2023-06-23 19:01:50,622 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=929778.0, ans=0.0 2023-06-23 19:01:51,178 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.64 vs. limit=15.0 2023-06-23 19:02:13,111 INFO [train.py:996] (0/4) Epoch 6, batch 2500, loss[loss=0.2594, simple_loss=0.2965, pruned_loss=0.1111, over 21366.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3041, pruned_loss=0.07915, over 4270375.93 frames. ], batch size: 508, lr: 5.30e-03, grad_scale: 16.0 2023-06-23 19:02:44,394 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.39 vs. limit=22.5 2023-06-23 19:03:03,175 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.913e+02 2.544e+02 2.837e+02 3.478e+02 5.146e+02, threshold=5.674e+02, percent-clipped=0.0 2023-06-23 19:03:03,820 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=929958.0, ans=0.1 2023-06-23 19:03:06,573 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.95 vs. limit=22.5 2023-06-23 19:03:45,956 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=930078.0, ans=0.125 2023-06-23 19:04:04,680 INFO [train.py:996] (0/4) Epoch 6, batch 2550, loss[loss=0.2262, simple_loss=0.2963, pruned_loss=0.07802, over 15108.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3029, pruned_loss=0.07765, over 4262186.32 frames. ], batch size: 60, lr: 5.30e-03, grad_scale: 16.0 2023-06-23 19:04:19,746 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=930138.0, ans=0.125 2023-06-23 19:04:23,042 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=930198.0, ans=0.0 2023-06-23 19:05:57,808 INFO [train.py:996] (0/4) Epoch 6, batch 2600, loss[loss=0.1978, simple_loss=0.2699, pruned_loss=0.06287, over 21587.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3034, pruned_loss=0.07759, over 4256822.09 frames. ], batch size: 263, lr: 5.30e-03, grad_scale: 16.0 2023-06-23 19:06:24,298 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=930498.0, ans=0.125 2023-06-23 19:06:47,829 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.153e+02 2.627e+02 2.988e+02 3.634e+02 5.525e+02, threshold=5.976e+02, percent-clipped=0.0 2023-06-23 19:07:05,571 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=930618.0, ans=0.0 2023-06-23 19:07:14,754 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=930618.0, ans=0.035 2023-06-23 19:07:16,425 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=930618.0, ans=0.1 2023-06-23 19:07:24,634 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=930678.0, ans=0.125 2023-06-23 19:07:31,492 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=930678.0, ans=0.125 2023-06-23 19:07:49,110 INFO [train.py:996] (0/4) Epoch 6, batch 2650, loss[loss=0.2249, simple_loss=0.2947, pruned_loss=0.07756, over 21392.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.3052, pruned_loss=0.07823, over 4267935.05 frames. ], batch size: 143, lr: 5.30e-03, grad_scale: 16.0 2023-06-23 19:08:10,315 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=930798.0, ans=0.1 2023-06-23 19:08:19,594 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.28 vs. limit=12.0 2023-06-23 19:08:37,477 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=930858.0, ans=0.5 2023-06-23 19:08:50,483 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=10.93 vs. limit=15.0 2023-06-23 19:09:42,217 INFO [train.py:996] (0/4) Epoch 6, batch 2700, loss[loss=0.2242, simple_loss=0.3081, pruned_loss=0.07009, over 21621.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.3039, pruned_loss=0.07816, over 4269862.98 frames. ], batch size: 389, lr: 5.30e-03, grad_scale: 16.0 2023-06-23 19:10:12,852 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 19:10:32,949 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.187e+02 2.709e+02 3.074e+02 3.590e+02 5.374e+02, threshold=6.148e+02, percent-clipped=0.0 2023-06-23 19:10:35,751 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=931158.0, ans=0.125 2023-06-23 19:11:24,674 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.61 vs. limit=15.0 2023-06-23 19:11:26,191 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=931278.0, ans=0.125 2023-06-23 19:11:34,510 INFO [train.py:996] (0/4) Epoch 6, batch 2750, loss[loss=0.255, simple_loss=0.3277, pruned_loss=0.09119, over 21741.00 frames. ], tot_loss[loss=0.23, simple_loss=0.303, pruned_loss=0.07851, over 4274731.89 frames. ], batch size: 112, lr: 5.30e-03, grad_scale: 16.0 2023-06-23 19:11:40,654 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=931338.0, ans=0.0 2023-06-23 19:11:44,165 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=931338.0, ans=0.125 2023-06-23 19:12:05,542 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.85 vs. limit=15.0 2023-06-23 19:12:20,448 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.15 vs. limit=15.0 2023-06-23 19:12:38,780 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=931518.0, ans=0.1 2023-06-23 19:13:24,237 INFO [train.py:996] (0/4) Epoch 6, batch 2800, loss[loss=0.2356, simple_loss=0.3322, pruned_loss=0.06954, over 21404.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3081, pruned_loss=0.0794, over 4272173.38 frames. ], batch size: 211, lr: 5.30e-03, grad_scale: 32.0 2023-06-23 19:13:30,529 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=931638.0, ans=0.0 2023-06-23 19:14:22,174 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.082e+02 2.716e+02 3.036e+02 3.413e+02 5.034e+02, threshold=6.071e+02, percent-clipped=0.0 2023-06-23 19:15:18,180 INFO [train.py:996] (0/4) Epoch 6, batch 2850, loss[loss=0.2285, simple_loss=0.3007, pruned_loss=0.07812, over 21728.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3096, pruned_loss=0.08127, over 4272960.23 frames. ], batch size: 298, lr: 5.30e-03, grad_scale: 32.0 2023-06-23 19:15:20,440 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=931938.0, ans=0.0 2023-06-23 19:15:22,388 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=931938.0, ans=0.125 2023-06-23 19:15:34,990 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=931938.0, ans=0.0 2023-06-23 19:15:45,731 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=931998.0, ans=0.125 2023-06-23 19:16:50,882 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=932178.0, ans=0.1 2023-06-23 19:16:59,344 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=932178.0, ans=0.125 2023-06-23 19:17:01,110 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=932178.0, ans=0.0 2023-06-23 19:17:07,589 INFO [train.py:996] (0/4) Epoch 6, batch 2900, loss[loss=0.2527, simple_loss=0.3083, pruned_loss=0.09854, over 21733.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3066, pruned_loss=0.08059, over 4281723.83 frames. ], batch size: 473, lr: 5.30e-03, grad_scale: 32.0 2023-06-23 19:17:49,914 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=932358.0, ans=0.1 2023-06-23 19:17:55,130 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 19:17:57,202 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=932358.0, ans=0.035 2023-06-23 19:18:03,657 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.041e+02 2.630e+02 3.132e+02 3.824e+02 7.694e+02, threshold=6.265e+02, percent-clipped=2.0 2023-06-23 19:18:21,900 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=932418.0, ans=0.0 2023-06-23 19:18:51,830 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=932478.0, ans=0.1 2023-06-23 19:18:58,086 INFO [train.py:996] (0/4) Epoch 6, batch 2950, loss[loss=0.2119, simple_loss=0.2889, pruned_loss=0.06749, over 21868.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3076, pruned_loss=0.08089, over 4290942.09 frames. ], batch size: 118, lr: 5.30e-03, grad_scale: 32.0 2023-06-23 19:19:05,975 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 19:20:00,898 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=932658.0, ans=0.05 2023-06-23 19:20:02,637 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff3.min_abs, batch_count=932658.0, ans=0.2 2023-06-23 19:20:02,667 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff3.min_abs, batch_count=932658.0, ans=0.2 2023-06-23 19:20:36,889 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=932778.0, ans=0.0 2023-06-23 19:20:50,790 INFO [train.py:996] (0/4) Epoch 6, batch 3000, loss[loss=0.2807, simple_loss=0.3488, pruned_loss=0.1063, over 21787.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3114, pruned_loss=0.08084, over 4295370.74 frames. ], batch size: 441, lr: 5.29e-03, grad_scale: 32.0 2023-06-23 19:20:50,792 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-23 19:21:12,305 INFO [zipformer.py:1728] (0/4) name=encoder.encoders.3.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([3.2249, 2.3771, 2.6210, 3.1496, 1.6980, 2.9401, 2.8982, 2.0219], device='cuda:0') 2023-06-23 19:21:13,116 INFO [train.py:1028] (0/4) Epoch 6, validation: loss=0.2526, simple_loss=0.3435, pruned_loss=0.08085, over 1796401.00 frames. 2023-06-23 19:21:13,117 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23616MB 2023-06-23 19:21:33,835 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=932898.0, ans=0.1 2023-06-23 19:21:53,777 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=932898.0, ans=0.125 2023-06-23 19:22:02,588 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=932898.0, ans=0.2 2023-06-23 19:22:12,488 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.12 vs. limit=22.5 2023-06-23 19:22:13,752 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=932958.0, ans=0.125 2023-06-23 19:22:14,691 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.002e+02 2.535e+02 2.851e+02 3.436e+02 5.853e+02, threshold=5.702e+02, percent-clipped=0.0 2023-06-23 19:22:19,640 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.60 vs. limit=6.0 2023-06-23 19:22:35,486 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.34 vs. limit=15.0 2023-06-23 19:23:05,122 INFO [train.py:996] (0/4) Epoch 6, batch 3050, loss[loss=0.2233, simple_loss=0.303, pruned_loss=0.07185, over 21725.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.311, pruned_loss=0.07875, over 4290141.45 frames. ], batch size: 414, lr: 5.29e-03, grad_scale: 32.0 2023-06-23 19:24:53,913 INFO [train.py:996] (0/4) Epoch 6, batch 3100, loss[loss=0.2249, simple_loss=0.3199, pruned_loss=0.06494, over 21588.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3108, pruned_loss=0.07837, over 4296587.46 frames. ], batch size: 389, lr: 5.29e-03, grad_scale: 32.0 2023-06-23 19:25:55,818 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.838e+02 2.716e+02 3.164e+02 3.740e+02 6.470e+02, threshold=6.328e+02, percent-clipped=4.0 2023-06-23 19:25:57,144 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.78 vs. limit=15.0 2023-06-23 19:26:02,326 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=933558.0, ans=0.125 2023-06-23 19:26:05,983 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=933618.0, ans=0.1 2023-06-23 19:26:29,873 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=933678.0, ans=0.0 2023-06-23 19:26:52,700 INFO [train.py:996] (0/4) Epoch 6, batch 3150, loss[loss=0.2652, simple_loss=0.3379, pruned_loss=0.09625, over 21237.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.312, pruned_loss=0.0785, over 4291445.66 frames. ], batch size: 176, lr: 5.29e-03, grad_scale: 32.0 2023-06-23 19:27:22,787 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=933798.0, ans=0.0 2023-06-23 19:27:45,851 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.whiten.whitening_limit, batch_count=933858.0, ans=15.0 2023-06-23 19:28:14,358 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=933918.0, ans=0.0 2023-06-23 19:28:17,863 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=933918.0, ans=0.125 2023-06-23 19:28:56,535 INFO [train.py:996] (0/4) Epoch 6, batch 3200, loss[loss=0.2301, simple_loss=0.3106, pruned_loss=0.0748, over 21712.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3131, pruned_loss=0.07883, over 4286846.65 frames. ], batch size: 298, lr: 5.29e-03, grad_scale: 32.0 2023-06-23 19:29:20,163 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=934098.0, ans=0.125 2023-06-23 19:29:46,300 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.879e+02 2.521e+02 2.818e+02 3.375e+02 4.819e+02, threshold=5.636e+02, percent-clipped=0.0 2023-06-23 19:29:51,140 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.98 vs. limit=22.5 2023-06-23 19:30:12,438 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=934218.0, ans=0.125 2023-06-23 19:30:46,860 INFO [train.py:996] (0/4) Epoch 6, batch 3250, loss[loss=0.1978, simple_loss=0.2468, pruned_loss=0.07439, over 20782.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3137, pruned_loss=0.07996, over 4283395.82 frames. ], batch size: 609, lr: 5.29e-03, grad_scale: 32.0 2023-06-23 19:32:41,486 INFO [train.py:996] (0/4) Epoch 6, batch 3300, loss[loss=0.2051, simple_loss=0.276, pruned_loss=0.06705, over 21628.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3071, pruned_loss=0.07956, over 4277944.81 frames. ], batch size: 282, lr: 5.29e-03, grad_scale: 32.0 2023-06-23 19:32:44,815 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.77 vs. limit=15.0 2023-06-23 19:33:38,660 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.006e+02 2.620e+02 2.941e+02 3.334e+02 7.153e+02, threshold=5.881e+02, percent-clipped=1.0 2023-06-23 19:33:39,239 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=934758.0, ans=0.0 2023-06-23 19:33:53,898 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.26 vs. limit=22.5 2023-06-23 19:34:30,363 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=934878.0, ans=0.125 2023-06-23 19:34:33,182 INFO [train.py:996] (0/4) Epoch 6, batch 3350, loss[loss=0.2489, simple_loss=0.3169, pruned_loss=0.09042, over 21371.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3091, pruned_loss=0.07953, over 4272173.69 frames. ], batch size: 176, lr: 5.29e-03, grad_scale: 32.0 2023-06-23 19:34:45,247 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=934938.0, ans=0.0 2023-06-23 19:35:11,181 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=934998.0, ans=0.1 2023-06-23 19:35:12,961 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=934998.0, ans=0.125 2023-06-23 19:35:16,612 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=935058.0, ans=0.05 2023-06-23 19:35:36,145 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=935058.0, ans=0.125 2023-06-23 19:35:58,057 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=935118.0, ans=0.1 2023-06-23 19:36:25,733 INFO [train.py:996] (0/4) Epoch 6, batch 3400, loss[loss=0.2192, simple_loss=0.2925, pruned_loss=0.07293, over 21536.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3089, pruned_loss=0.08006, over 4273751.05 frames. ], batch size: 195, lr: 5.29e-03, grad_scale: 16.0 2023-06-23 19:37:15,618 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=935358.0, ans=0.125 2023-06-23 19:37:30,211 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.050e+02 2.631e+02 2.892e+02 3.496e+02 6.427e+02, threshold=5.784e+02, percent-clipped=1.0 2023-06-23 19:38:01,557 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 19:38:18,966 INFO [train.py:996] (0/4) Epoch 6, batch 3450, loss[loss=0.195, simple_loss=0.2528, pruned_loss=0.06864, over 21469.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3035, pruned_loss=0.07875, over 4271987.55 frames. ], batch size: 212, lr: 5.29e-03, grad_scale: 16.0 2023-06-23 19:39:30,718 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.72 vs. limit=12.0 2023-06-23 19:39:44,290 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=935718.0, ans=0.1 2023-06-23 19:40:16,018 INFO [train.py:996] (0/4) Epoch 6, batch 3500, loss[loss=0.2532, simple_loss=0.3304, pruned_loss=0.08802, over 21373.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3136, pruned_loss=0.08291, over 4277026.46 frames. ], batch size: 549, lr: 5.29e-03, grad_scale: 16.0 2023-06-23 19:40:41,207 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=935898.0, ans=0.125 2023-06-23 19:41:16,943 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.217e+02 2.777e+02 3.098e+02 3.671e+02 6.397e+02, threshold=6.196e+02, percent-clipped=1.0 2023-06-23 19:41:17,340 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-156000.pt 2023-06-23 19:41:21,312 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=935958.0, ans=0.125 2023-06-23 19:41:48,027 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=936078.0, ans=0.2 2023-06-23 19:41:56,564 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=936078.0, ans=0.125 2023-06-23 19:42:08,065 INFO [train.py:996] (0/4) Epoch 6, batch 3550, loss[loss=0.2144, simple_loss=0.283, pruned_loss=0.07286, over 21687.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.3162, pruned_loss=0.08436, over 4281281.81 frames. ], batch size: 282, lr: 5.29e-03, grad_scale: 16.0 2023-06-23 19:42:45,799 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=936198.0, ans=0.0 2023-06-23 19:43:13,976 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=936258.0, ans=0.125 2023-06-23 19:43:24,501 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=936318.0, ans=0.125 2023-06-23 19:43:51,979 INFO [train.py:996] (0/4) Epoch 6, batch 3600, loss[loss=0.2474, simple_loss=0.3191, pruned_loss=0.08787, over 21854.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3111, pruned_loss=0.08323, over 4274901.41 frames. ], batch size: 118, lr: 5.28e-03, grad_scale: 32.0 2023-06-23 19:45:00,205 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.031e+02 2.636e+02 3.056e+02 3.547e+02 6.528e+02, threshold=6.113e+02, percent-clipped=1.0 2023-06-23 19:45:18,764 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=936618.0, ans=0.0 2023-06-23 19:45:48,094 INFO [train.py:996] (0/4) Epoch 6, batch 3650, loss[loss=0.2871, simple_loss=0.3691, pruned_loss=0.1026, over 21544.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.3125, pruned_loss=0.08359, over 4273701.08 frames. ], batch size: 508, lr: 5.28e-03, grad_scale: 32.0 2023-06-23 19:45:57,595 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=936738.0, ans=0.125 2023-06-23 19:47:36,929 INFO [train.py:996] (0/4) Epoch 6, batch 3700, loss[loss=0.2272, simple_loss=0.3075, pruned_loss=0.07343, over 21792.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3101, pruned_loss=0.08216, over 4285382.59 frames. ], batch size: 247, lr: 5.28e-03, grad_scale: 32.0 2023-06-23 19:48:38,105 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.081e+02 2.573e+02 2.941e+02 3.537e+02 5.018e+02, threshold=5.882e+02, percent-clipped=0.0 2023-06-23 19:48:42,992 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.13 vs. limit=22.5 2023-06-23 19:49:27,061 INFO [train.py:996] (0/4) Epoch 6, batch 3750, loss[loss=0.2136, simple_loss=0.2832, pruned_loss=0.07203, over 21845.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3091, pruned_loss=0.08157, over 4286100.95 frames. ], batch size: 107, lr: 5.28e-03, grad_scale: 32.0 2023-06-23 19:49:35,191 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=937338.0, ans=0.125 2023-06-23 19:49:50,700 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=937338.0, ans=0.1 2023-06-23 19:50:10,689 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.23 vs. limit=15.0 2023-06-23 19:50:14,806 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.64 vs. limit=22.5 2023-06-23 19:51:03,653 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=937578.0, ans=0.0 2023-06-23 19:51:29,606 INFO [train.py:996] (0/4) Epoch 6, batch 3800, loss[loss=0.2121, simple_loss=0.2889, pruned_loss=0.06768, over 21118.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3067, pruned_loss=0.07961, over 4284533.78 frames. ], batch size: 608, lr: 5.28e-03, grad_scale: 16.0 2023-06-23 19:51:46,054 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=937638.0, ans=0.2 2023-06-23 19:51:56,333 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 19:52:20,712 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=937758.0, ans=0.2 2023-06-23 19:52:21,711 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.562e+02 2.479e+02 2.831e+02 3.335e+02 6.491e+02, threshold=5.662e+02, percent-clipped=1.0 2023-06-23 19:52:48,403 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=937818.0, ans=0.125 2023-06-23 19:53:10,876 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=937878.0, ans=0.125 2023-06-23 19:53:12,377 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=937878.0, ans=0.0 2023-06-23 19:53:20,151 INFO [train.py:996] (0/4) Epoch 6, batch 3850, loss[loss=0.1952, simple_loss=0.2586, pruned_loss=0.06587, over 21599.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3042, pruned_loss=0.07939, over 4289343.32 frames. ], batch size: 298, lr: 5.28e-03, grad_scale: 16.0 2023-06-23 19:53:35,173 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.42 vs. limit=15.0 2023-06-23 19:54:00,925 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=938058.0, ans=0.5 2023-06-23 19:54:04,239 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=938058.0, ans=0.0 2023-06-23 19:55:09,883 INFO [train.py:996] (0/4) Epoch 6, batch 3900, loss[loss=0.2167, simple_loss=0.2819, pruned_loss=0.0757, over 21847.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.2995, pruned_loss=0.07898, over 4288524.64 frames. ], batch size: 371, lr: 5.28e-03, grad_scale: 16.0 2023-06-23 19:55:21,311 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=938238.0, ans=0.125 2023-06-23 19:55:21,364 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 19:55:25,678 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.40 vs. limit=10.0 2023-06-23 19:56:02,984 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.187e+02 2.781e+02 3.101e+02 3.883e+02 8.958e+02, threshold=6.202e+02, percent-clipped=3.0 2023-06-23 19:56:05,360 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=938358.0, ans=0.125 2023-06-23 19:56:13,498 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.06 vs. limit=12.0 2023-06-23 19:56:36,556 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=938478.0, ans=0.125 2023-06-23 19:57:05,036 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=938538.0, ans=0.125 2023-06-23 19:57:06,057 INFO [train.py:996] (0/4) Epoch 6, batch 3950, loss[loss=0.1865, simple_loss=0.2632, pruned_loss=0.0549, over 21138.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3013, pruned_loss=0.07854, over 4291230.09 frames. ], batch size: 143, lr: 5.28e-03, grad_scale: 16.0 2023-06-23 19:57:29,489 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=938598.0, ans=0.0 2023-06-23 19:57:37,882 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=938598.0, ans=0.125 2023-06-23 19:57:54,818 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.76 vs. limit=15.0 2023-06-23 19:57:56,454 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.13 vs. limit=15.0 2023-06-23 19:58:26,374 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=938718.0, ans=0.05 2023-06-23 19:58:30,157 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.03 vs. limit=15.0 2023-06-23 19:58:53,499 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=938778.0, ans=0.04949747468305833 2023-06-23 19:58:56,515 INFO [train.py:996] (0/4) Epoch 6, batch 4000, loss[loss=0.1909, simple_loss=0.2583, pruned_loss=0.06176, over 21778.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.296, pruned_loss=0.0754, over 4283798.53 frames. ], batch size: 351, lr: 5.28e-03, grad_scale: 32.0 2023-06-23 19:59:36,129 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=938958.0, ans=0.1 2023-06-23 19:59:43,445 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=938958.0, ans=0.0 2023-06-23 19:59:44,420 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.908e+02 2.407e+02 2.711e+02 3.233e+02 5.039e+02, threshold=5.423e+02, percent-clipped=0.0 2023-06-23 20:00:11,412 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=939018.0, ans=0.125 2023-06-23 20:00:15,023 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=939018.0, ans=0.125 2023-06-23 20:00:22,109 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=939078.0, ans=0.0 2023-06-23 20:00:24,003 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 20:00:24,635 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.12 vs. limit=15.0 2023-06-23 20:00:47,418 INFO [train.py:996] (0/4) Epoch 6, batch 4050, loss[loss=0.2536, simple_loss=0.3592, pruned_loss=0.074, over 21271.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.2946, pruned_loss=0.07436, over 4286255.25 frames. ], batch size: 548, lr: 5.28e-03, grad_scale: 32.0 2023-06-23 20:01:01,732 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=939138.0, ans=0.0 2023-06-23 20:01:03,562 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=939198.0, ans=0.0 2023-06-23 20:01:20,679 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.55 vs. limit=22.5 2023-06-23 20:02:32,567 INFO [train.py:996] (0/4) Epoch 6, batch 4100, loss[loss=0.2333, simple_loss=0.3161, pruned_loss=0.07529, over 21707.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.2958, pruned_loss=0.07442, over 4290048.88 frames. ], batch size: 389, lr: 5.28e-03, grad_scale: 32.0 2023-06-23 20:02:58,571 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.24 vs. limit=12.0 2023-06-23 20:03:26,713 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.792e+02 2.413e+02 2.658e+02 3.099e+02 5.779e+02, threshold=5.316e+02, percent-clipped=1.0 2023-06-23 20:03:27,508 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=939558.0, ans=0.0 2023-06-23 20:04:00,067 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=939678.0, ans=0.2 2023-06-23 20:04:07,171 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=939678.0, ans=0.125 2023-06-23 20:04:18,588 INFO [train.py:996] (0/4) Epoch 6, batch 4150, loss[loss=0.1794, simple_loss=0.2755, pruned_loss=0.0417, over 21148.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.2963, pruned_loss=0.07222, over 4284997.32 frames. ], batch size: 159, lr: 5.28e-03, grad_scale: 32.0 2023-06-23 20:04:21,027 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=939738.0, ans=0.035 2023-06-23 20:05:07,359 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=939858.0, ans=0.125 2023-06-23 20:06:00,137 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=939978.0, ans=0.125 2023-06-23 20:06:12,067 INFO [train.py:996] (0/4) Epoch 6, batch 4200, loss[loss=0.2622, simple_loss=0.3224, pruned_loss=0.101, over 21450.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.2964, pruned_loss=0.07091, over 4279129.61 frames. ], batch size: 473, lr: 5.27e-03, grad_scale: 32.0 2023-06-23 20:06:27,924 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.22 vs. limit=6.0 2023-06-23 20:06:38,565 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=940098.0, ans=0.0 2023-06-23 20:07:08,952 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.98 vs. limit=15.0 2023-06-23 20:07:18,284 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.688e+02 2.286e+02 2.656e+02 3.507e+02 6.693e+02, threshold=5.313e+02, percent-clipped=3.0 2023-06-23 20:07:38,671 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=940218.0, ans=0.1 2023-06-23 20:07:46,434 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=940278.0, ans=0.125 2023-06-23 20:08:05,650 INFO [train.py:996] (0/4) Epoch 6, batch 4250, loss[loss=0.2495, simple_loss=0.3468, pruned_loss=0.0761, over 21854.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.3024, pruned_loss=0.07331, over 4273611.64 frames. ], batch size: 317, lr: 5.27e-03, grad_scale: 16.0 2023-06-23 20:08:06,850 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.19 vs. limit=12.0 2023-06-23 20:08:28,967 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=940338.0, ans=0.125 2023-06-23 20:08:43,367 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=940398.0, ans=0.0 2023-06-23 20:09:05,538 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.90 vs. limit=15.0 2023-06-23 20:09:21,145 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=940518.0, ans=0.125 2023-06-23 20:09:30,361 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=940518.0, ans=0.0 2023-06-23 20:09:59,056 INFO [train.py:996] (0/4) Epoch 6, batch 4300, loss[loss=0.2195, simple_loss=0.3119, pruned_loss=0.06352, over 21830.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3106, pruned_loss=0.07616, over 4270342.99 frames. ], batch size: 282, lr: 5.27e-03, grad_scale: 16.0 2023-06-23 20:10:18,634 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=940638.0, ans=0.125 2023-06-23 20:10:32,264 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.14 vs. limit=15.0 2023-06-23 20:11:11,082 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.144e+02 2.724e+02 3.223e+02 4.213e+02 6.998e+02, threshold=6.446e+02, percent-clipped=6.0 2023-06-23 20:11:29,089 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=940818.0, ans=0.07 2023-06-23 20:11:45,087 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=940878.0, ans=0.1 2023-06-23 20:12:00,247 INFO [train.py:996] (0/4) Epoch 6, batch 4350, loss[loss=0.2418, simple_loss=0.3553, pruned_loss=0.06414, over 21271.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3109, pruned_loss=0.0757, over 4266525.54 frames. ], batch size: 548, lr: 5.27e-03, grad_scale: 16.0 2023-06-23 20:12:03,429 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.41 vs. limit=15.0 2023-06-23 20:12:20,060 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=940998.0, ans=0.125 2023-06-23 20:13:18,933 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=941118.0, ans=0.0 2023-06-23 20:13:51,157 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=941238.0, ans=0.1 2023-06-23 20:13:51,993 INFO [train.py:996] (0/4) Epoch 6, batch 4400, loss[loss=0.2475, simple_loss=0.3711, pruned_loss=0.06192, over 19859.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3066, pruned_loss=0.07505, over 4256519.61 frames. ], batch size: 702, lr: 5.27e-03, grad_scale: 32.0 2023-06-23 20:14:02,200 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.19 vs. limit=15.0 2023-06-23 20:14:40,343 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.86 vs. limit=22.5 2023-06-23 20:14:50,472 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=941358.0, ans=0.0 2023-06-23 20:14:53,320 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.950e+02 2.531e+02 2.865e+02 3.462e+02 7.210e+02, threshold=5.730e+02, percent-clipped=2.0 2023-06-23 20:14:56,461 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.66 vs. limit=15.0 2023-06-23 20:15:43,305 INFO [train.py:996] (0/4) Epoch 6, batch 4450, loss[loss=0.2305, simple_loss=0.3028, pruned_loss=0.0791, over 21449.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3142, pruned_loss=0.07672, over 4267223.56 frames. ], batch size: 131, lr: 5.27e-03, grad_scale: 32.0 2023-06-23 20:16:34,567 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=941658.0, ans=0.125 2023-06-23 20:16:48,435 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=941718.0, ans=0.2 2023-06-23 20:17:01,123 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.75 vs. limit=12.0 2023-06-23 20:17:38,978 INFO [train.py:996] (0/4) Epoch 6, batch 4500, loss[loss=0.2578, simple_loss=0.3451, pruned_loss=0.08528, over 21893.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3158, pruned_loss=0.07902, over 4276425.18 frames. ], batch size: 371, lr: 5.27e-03, grad_scale: 32.0 2023-06-23 20:17:55,919 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=941838.0, ans=0.125 2023-06-23 20:18:09,463 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.19 vs. limit=22.5 2023-06-23 20:18:14,212 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=941898.0, ans=0.125 2023-06-23 20:18:17,740 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 20:18:32,694 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.004e+02 2.439e+02 2.793e+02 3.421e+02 5.110e+02, threshold=5.586e+02, percent-clipped=0.0 2023-06-23 20:19:34,556 INFO [train.py:996] (0/4) Epoch 6, batch 4550, loss[loss=0.255, simple_loss=0.328, pruned_loss=0.091, over 21314.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3185, pruned_loss=0.07985, over 4275116.53 frames. ], batch size: 548, lr: 5.27e-03, grad_scale: 32.0 2023-06-23 20:19:38,696 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=942138.0, ans=0.125 2023-06-23 20:20:22,342 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=942258.0, ans=0.125 2023-06-23 20:20:36,805 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=942258.0, ans=0.125 2023-06-23 20:21:12,636 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.03 vs. limit=6.0 2023-06-23 20:21:21,714 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.06 vs. limit=15.0 2023-06-23 20:21:22,613 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=942378.0, ans=0.0 2023-06-23 20:21:25,296 INFO [train.py:996] (0/4) Epoch 6, batch 4600, loss[loss=0.1977, simple_loss=0.2771, pruned_loss=0.0592, over 21656.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.32, pruned_loss=0.08095, over 4274057.18 frames. ], batch size: 230, lr: 5.27e-03, grad_scale: 32.0 2023-06-23 20:21:55,803 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=942498.0, ans=0.125 2023-06-23 20:22:25,595 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.170e+02 2.585e+02 3.169e+02 3.580e+02 7.815e+02, threshold=6.337e+02, percent-clipped=3.0 2023-06-23 20:22:33,123 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=942618.0, ans=0.05 2023-06-23 20:22:36,549 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=942618.0, ans=0.1 2023-06-23 20:22:37,236 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.21 vs. limit=12.0 2023-06-23 20:22:45,779 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=942618.0, ans=0.2 2023-06-23 20:22:51,591 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.06 vs. limit=15.0 2023-06-23 20:23:13,750 INFO [train.py:996] (0/4) Epoch 6, batch 4650, loss[loss=0.1775, simple_loss=0.2449, pruned_loss=0.05502, over 21251.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3138, pruned_loss=0.07969, over 4269924.74 frames. ], batch size: 159, lr: 5.27e-03, grad_scale: 32.0 2023-06-23 20:23:14,584 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=942738.0, ans=0.125 2023-06-23 20:23:16,215 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 20:23:50,889 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.14 vs. limit=15.0 2023-06-23 20:24:00,521 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=942858.0, ans=0.1 2023-06-23 20:24:07,653 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=942858.0, ans=0.125 2023-06-23 20:24:19,992 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=942918.0, ans=0.2 2023-06-23 20:24:25,501 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=942918.0, ans=0.2 2023-06-23 20:24:55,392 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=942978.0, ans=0.125 2023-06-23 20:25:03,301 INFO [train.py:996] (0/4) Epoch 6, batch 4700, loss[loss=0.1961, simple_loss=0.2624, pruned_loss=0.06493, over 21664.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3039, pruned_loss=0.07712, over 4270517.69 frames. ], batch size: 282, lr: 5.27e-03, grad_scale: 16.0 2023-06-23 20:25:11,517 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.21 vs. limit=6.0 2023-06-23 20:26:04,157 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.777e+02 2.385e+02 2.698e+02 3.095e+02 5.090e+02, threshold=5.395e+02, percent-clipped=0.0 2023-06-23 20:26:36,118 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=943278.0, ans=22.5 2023-06-23 20:26:50,579 INFO [train.py:996] (0/4) Epoch 6, batch 4750, loss[loss=0.2429, simple_loss=0.2936, pruned_loss=0.09616, over 20252.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.298, pruned_loss=0.07714, over 4270483.19 frames. ], batch size: 707, lr: 5.27e-03, grad_scale: 16.0 2023-06-23 20:27:01,833 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=943338.0, ans=0.07 2023-06-23 20:27:29,647 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=943458.0, ans=0.2 2023-06-23 20:28:39,518 INFO [train.py:996] (0/4) Epoch 6, batch 4800, loss[loss=0.2213, simple_loss=0.3304, pruned_loss=0.05615, over 19793.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.2979, pruned_loss=0.07737, over 4275427.81 frames. ], batch size: 703, lr: 5.26e-03, grad_scale: 32.0 2023-06-23 20:29:09,190 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=943698.0, ans=0.125 2023-06-23 20:29:28,021 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=943758.0, ans=0.125 2023-06-23 20:29:42,948 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.177e+02 2.734e+02 3.125e+02 3.511e+02 5.007e+02, threshold=6.249e+02, percent-clipped=0.0 2023-06-23 20:29:57,134 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=943818.0, ans=0.125 2023-06-23 20:29:59,790 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.22 vs. limit=6.0 2023-06-23 20:30:24,283 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=943878.0, ans=0.07 2023-06-23 20:30:27,086 INFO [train.py:996] (0/4) Epoch 6, batch 4850, loss[loss=0.1923, simple_loss=0.2738, pruned_loss=0.05542, over 21657.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.2976, pruned_loss=0.07632, over 4277037.00 frames. ], batch size: 298, lr: 5.26e-03, grad_scale: 16.0 2023-06-23 20:30:41,721 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=943938.0, ans=0.05 2023-06-23 20:31:12,921 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=944058.0, ans=0.125 2023-06-23 20:31:34,251 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=944118.0, ans=0.1 2023-06-23 20:32:17,568 INFO [train.py:996] (0/4) Epoch 6, batch 4900, loss[loss=0.1894, simple_loss=0.2714, pruned_loss=0.05372, over 20108.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.2983, pruned_loss=0.07705, over 4274240.35 frames. ], batch size: 703, lr: 5.26e-03, grad_scale: 16.0 2023-06-23 20:32:54,999 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=944298.0, ans=0.1 2023-06-23 20:33:28,631 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.946e+02 2.472e+02 2.764e+02 3.016e+02 5.453e+02, threshold=5.528e+02, percent-clipped=0.0 2023-06-23 20:34:09,905 INFO [train.py:996] (0/4) Epoch 6, batch 4950, loss[loss=0.1801, simple_loss=0.2635, pruned_loss=0.04834, over 21314.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.3025, pruned_loss=0.07491, over 4275867.07 frames. ], batch size: 211, lr: 5.26e-03, grad_scale: 16.0 2023-06-23 20:35:33,485 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.60 vs. limit=15.0 2023-06-23 20:35:43,702 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=944778.0, ans=0.125 2023-06-23 20:35:58,238 INFO [train.py:996] (0/4) Epoch 6, batch 5000, loss[loss=0.2094, simple_loss=0.285, pruned_loss=0.06693, over 21455.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.3022, pruned_loss=0.07217, over 4279043.52 frames. ], batch size: 194, lr: 5.26e-03, grad_scale: 16.0 2023-06-23 20:36:29,010 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=944898.0, ans=0.125 2023-06-23 20:37:01,265 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.831e+02 2.469e+02 2.951e+02 3.464e+02 5.172e+02, threshold=5.903e+02, percent-clipped=0.0 2023-06-23 20:37:37,380 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=945078.0, ans=0.0 2023-06-23 20:37:40,310 INFO [train.py:996] (0/4) Epoch 6, batch 5050, loss[loss=0.221, simple_loss=0.2929, pruned_loss=0.07454, over 21933.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.3024, pruned_loss=0.07462, over 4291084.94 frames. ], batch size: 333, lr: 5.26e-03, grad_scale: 16.0 2023-06-23 20:38:24,738 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.06 vs. limit=15.0 2023-06-23 20:38:27,869 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=945258.0, ans=0.025 2023-06-23 20:38:31,791 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.90 vs. limit=15.0 2023-06-23 20:39:09,836 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=945378.0, ans=0.0 2023-06-23 20:39:22,254 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=945378.0, ans=0.1 2023-06-23 20:39:26,520 INFO [train.py:996] (0/4) Epoch 6, batch 5100, loss[loss=0.2116, simple_loss=0.2848, pruned_loss=0.06914, over 21861.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.3014, pruned_loss=0.07464, over 4285776.85 frames. ], batch size: 124, lr: 5.26e-03, grad_scale: 16.0 2023-06-23 20:40:20,264 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=945558.0, ans=0.0 2023-06-23 20:40:28,033 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.43 vs. limit=15.0 2023-06-23 20:40:28,073 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.11 vs. limit=15.0 2023-06-23 20:40:30,252 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.074e+02 2.802e+02 3.209e+02 3.785e+02 5.711e+02, threshold=6.418e+02, percent-clipped=0.0 2023-06-23 20:40:42,967 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=945618.0, ans=0.125 2023-06-23 20:41:09,272 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=945678.0, ans=0.125 2023-06-23 20:41:15,796 INFO [train.py:996] (0/4) Epoch 6, batch 5150, loss[loss=0.247, simple_loss=0.3244, pruned_loss=0.08483, over 21400.00 frames. ], tot_loss[loss=0.227, simple_loss=0.3023, pruned_loss=0.07581, over 4289213.27 frames. ], batch size: 548, lr: 5.26e-03, grad_scale: 16.0 2023-06-23 20:41:27,987 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.78 vs. limit=22.5 2023-06-23 20:41:45,030 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=945798.0, ans=0.125 2023-06-23 20:42:15,537 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=945858.0, ans=0.2 2023-06-23 20:43:01,147 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=945978.0, ans=0.07 2023-06-23 20:43:02,835 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=945978.0, ans=0.0 2023-06-23 20:43:05,742 INFO [train.py:996] (0/4) Epoch 6, batch 5200, loss[loss=0.2006, simple_loss=0.2618, pruned_loss=0.06965, over 21335.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.3031, pruned_loss=0.07661, over 4286283.13 frames. ], batch size: 176, lr: 5.26e-03, grad_scale: 32.0 2023-06-23 20:43:37,499 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.84 vs. limit=12.0 2023-06-23 20:44:01,305 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=946158.0, ans=0.125 2023-06-23 20:44:14,594 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.120e+02 2.657e+02 3.031e+02 3.767e+02 5.750e+02, threshold=6.062e+02, percent-clipped=0.0 2023-06-23 20:44:23,255 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=946218.0, ans=0.0 2023-06-23 20:44:26,608 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=946218.0, ans=0.1 2023-06-23 20:44:34,057 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.15 vs. limit=22.5 2023-06-23 20:44:42,286 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=946278.0, ans=0.125 2023-06-23 20:44:59,550 INFO [train.py:996] (0/4) Epoch 6, batch 5250, loss[loss=0.171, simple_loss=0.2388, pruned_loss=0.05157, over 21821.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3069, pruned_loss=0.07565, over 4284267.07 frames. ], batch size: 102, lr: 5.26e-03, grad_scale: 32.0 2023-06-23 20:45:11,037 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=946338.0, ans=0.2 2023-06-23 20:45:17,916 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=946338.0, ans=0.125 2023-06-23 20:46:52,859 INFO [train.py:996] (0/4) Epoch 6, batch 5300, loss[loss=0.2338, simple_loss=0.3002, pruned_loss=0.08373, over 21893.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.3061, pruned_loss=0.07685, over 4289844.31 frames. ], batch size: 351, lr: 5.26e-03, grad_scale: 32.0 2023-06-23 20:47:20,141 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.91 vs. limit=15.0 2023-06-23 20:47:20,201 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=946698.0, ans=15.0 2023-06-23 20:47:49,575 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.34 vs. limit=15.0 2023-06-23 20:47:55,263 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.995e+02 2.539e+02 2.781e+02 3.236e+02 4.836e+02, threshold=5.563e+02, percent-clipped=0.0 2023-06-23 20:47:58,303 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.89 vs. limit=22.5 2023-06-23 20:47:58,370 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.53 vs. limit=15.0 2023-06-23 20:48:04,826 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=946818.0, ans=0.125 2023-06-23 20:48:41,803 INFO [train.py:996] (0/4) Epoch 6, batch 5350, loss[loss=0.2195, simple_loss=0.2879, pruned_loss=0.07557, over 21903.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3048, pruned_loss=0.07814, over 4294804.39 frames. ], batch size: 316, lr: 5.26e-03, grad_scale: 32.0 2023-06-23 20:48:44,193 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=946938.0, ans=0.125 2023-06-23 20:48:56,205 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=946938.0, ans=0.1 2023-06-23 20:49:07,440 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.05 vs. limit=8.0 2023-06-23 20:50:25,017 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=947178.0, ans=0.1 2023-06-23 20:50:29,939 INFO [train.py:996] (0/4) Epoch 6, batch 5400, loss[loss=0.2031, simple_loss=0.2657, pruned_loss=0.07026, over 21651.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3048, pruned_loss=0.07823, over 4283663.36 frames. ], batch size: 263, lr: 5.25e-03, grad_scale: 16.0 2023-06-23 20:50:34,657 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.65 vs. limit=15.0 2023-06-23 20:51:10,886 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=947358.0, ans=0.0 2023-06-23 20:51:33,295 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=947418.0, ans=0.1 2023-06-23 20:51:34,443 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.034e+02 2.654e+02 3.257e+02 3.898e+02 6.722e+02, threshold=6.513e+02, percent-clipped=2.0 2023-06-23 20:51:36,877 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=947418.0, ans=0.04949747468305833 2023-06-23 20:52:11,273 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=947478.0, ans=0.125 2023-06-23 20:52:19,513 INFO [train.py:996] (0/4) Epoch 6, batch 5450, loss[loss=0.2138, simple_loss=0.2795, pruned_loss=0.07407, over 21181.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3042, pruned_loss=0.07625, over 4281732.75 frames. ], batch size: 608, lr: 5.25e-03, grad_scale: 16.0 2023-06-23 20:53:13,018 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=947658.0, ans=0.125 2023-06-23 20:54:09,252 INFO [train.py:996] (0/4) Epoch 6, batch 5500, loss[loss=0.198, simple_loss=0.2982, pruned_loss=0.04895, over 21658.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.3091, pruned_loss=0.0734, over 4279133.58 frames. ], batch size: 263, lr: 5.25e-03, grad_scale: 16.0 2023-06-23 20:54:20,623 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=947838.0, ans=0.2 2023-06-23 20:54:29,752 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.22 vs. limit=15.0 2023-06-23 20:54:57,167 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.82 vs. limit=15.0 2023-06-23 20:55:03,584 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=947958.0, ans=0.125 2023-06-23 20:55:24,834 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.705e+02 2.255e+02 2.654e+02 3.007e+02 4.668e+02, threshold=5.308e+02, percent-clipped=0.0 2023-06-23 20:55:29,674 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.57 vs. limit=22.5 2023-06-23 20:56:04,053 INFO [train.py:996] (0/4) Epoch 6, batch 5550, loss[loss=0.2166, simple_loss=0.2905, pruned_loss=0.07133, over 21016.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.3076, pruned_loss=0.06961, over 4272008.10 frames. ], batch size: 607, lr: 5.25e-03, grad_scale: 16.0 2023-06-23 20:56:39,319 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=948198.0, ans=0.1 2023-06-23 20:56:39,365 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=948198.0, ans=0.125 2023-06-23 20:57:01,268 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=948258.0, ans=0.025 2023-06-23 20:57:05,281 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.78 vs. limit=12.0 2023-06-23 20:57:16,100 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=948318.0, ans=0.1 2023-06-23 20:57:39,466 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.39 vs. limit=22.5 2023-06-23 20:57:56,421 INFO [train.py:996] (0/4) Epoch 6, batch 5600, loss[loss=0.2023, simple_loss=0.2812, pruned_loss=0.06165, over 21149.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.3036, pruned_loss=0.06651, over 4278984.24 frames. ], batch size: 143, lr: 5.25e-03, grad_scale: 32.0 2023-06-23 20:58:14,938 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=948498.0, ans=0.125 2023-06-23 20:58:15,006 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=948498.0, ans=0.125 2023-06-23 20:58:25,114 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_na.min_abs, batch_count=948498.0, ans=0.02 2023-06-23 20:58:33,374 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.57 vs. limit=6.0 2023-06-23 20:58:47,025 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.43 vs. limit=15.0 2023-06-23 20:58:48,423 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 20:59:01,214 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.636e+02 2.332e+02 2.800e+02 3.364e+02 5.770e+02, threshold=5.601e+02, percent-clipped=3.0 2023-06-23 20:59:12,478 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=948618.0, ans=0.125 2023-06-23 20:59:44,388 INFO [train.py:996] (0/4) Epoch 6, batch 5650, loss[loss=0.2349, simple_loss=0.3073, pruned_loss=0.08118, over 21888.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.3073, pruned_loss=0.06843, over 4274985.64 frames. ], batch size: 351, lr: 5.25e-03, grad_scale: 32.0 2023-06-23 21:00:26,820 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.60 vs. limit=15.0 2023-06-23 21:00:39,046 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=948858.0, ans=0.0 2023-06-23 21:01:29,437 INFO [train.py:996] (0/4) Epoch 6, batch 5700, loss[loss=0.2464, simple_loss=0.3074, pruned_loss=0.09273, over 21607.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.3087, pruned_loss=0.07103, over 4275456.23 frames. ], batch size: 548, lr: 5.25e-03, grad_scale: 32.0 2023-06-23 21:02:41,758 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.056e+02 2.515e+02 2.975e+02 3.453e+02 5.794e+02, threshold=5.950e+02, percent-clipped=1.0 2023-06-23 21:02:55,064 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=949218.0, ans=0.1 2023-06-23 21:03:02,650 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.70 vs. limit=22.5 2023-06-23 21:03:31,998 INFO [train.py:996] (0/4) Epoch 6, batch 5750, loss[loss=0.1785, simple_loss=0.271, pruned_loss=0.04299, over 21778.00 frames. ], tot_loss[loss=0.223, simple_loss=0.3057, pruned_loss=0.07012, over 4282032.71 frames. ], batch size: 332, lr: 5.25e-03, grad_scale: 32.0 2023-06-23 21:03:40,624 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.85 vs. limit=12.0 2023-06-23 21:05:12,854 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=949578.0, ans=0.0 2023-06-23 21:05:22,457 INFO [train.py:996] (0/4) Epoch 6, batch 5800, loss[loss=0.329, simple_loss=0.4044, pruned_loss=0.1268, over 21501.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.3051, pruned_loss=0.06854, over 4283499.97 frames. ], batch size: 508, lr: 5.25e-03, grad_scale: 32.0 2023-06-23 21:06:23,884 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.05 vs. limit=22.5 2023-06-23 21:06:27,701 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.707e+02 2.304e+02 2.799e+02 4.068e+02 6.558e+02, threshold=5.598e+02, percent-clipped=2.0 2023-06-23 21:06:33,718 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=949818.0, ans=0.0 2023-06-23 21:07:04,955 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.78 vs. limit=15.0 2023-06-23 21:07:11,457 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=949938.0, ans=0.125 2023-06-23 21:07:12,470 INFO [train.py:996] (0/4) Epoch 6, batch 5850, loss[loss=0.1607, simple_loss=0.237, pruned_loss=0.04217, over 21900.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.3032, pruned_loss=0.06487, over 4287118.37 frames. ], batch size: 107, lr: 5.25e-03, grad_scale: 32.0 2023-06-23 21:07:48,851 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.41 vs. limit=6.0 2023-06-23 21:08:32,084 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 21:08:55,269 INFO [train.py:996] (0/4) Epoch 6, batch 5900, loss[loss=0.1914, simple_loss=0.2678, pruned_loss=0.05752, over 21779.00 frames. ], tot_loss[loss=0.2085, simple_loss=0.2964, pruned_loss=0.06029, over 4280987.76 frames. ], batch size: 298, lr: 5.25e-03, grad_scale: 32.0 2023-06-23 21:09:02,752 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=950238.0, ans=0.125 2023-06-23 21:09:14,868 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=950298.0, ans=0.1 2023-06-23 21:09:46,703 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=950358.0, ans=0.0 2023-06-23 21:09:57,803 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.381e+02 1.988e+02 2.407e+02 3.041e+02 4.833e+02, threshold=4.814e+02, percent-clipped=0.0 2023-06-23 21:10:41,853 INFO [train.py:996] (0/4) Epoch 6, batch 5950, loss[loss=0.2086, simple_loss=0.276, pruned_loss=0.07055, over 21746.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.2955, pruned_loss=0.06399, over 4278768.47 frames. ], batch size: 333, lr: 5.25e-03, grad_scale: 32.0 2023-06-23 21:10:48,140 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.32 vs. limit=6.0 2023-06-23 21:11:45,673 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=950718.0, ans=0.05 2023-06-23 21:12:02,715 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=950718.0, ans=0.2 2023-06-23 21:12:30,062 INFO [train.py:996] (0/4) Epoch 6, batch 6000, loss[loss=0.1861, simple_loss=0.2527, pruned_loss=0.05968, over 21753.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2908, pruned_loss=0.0672, over 4285071.16 frames. ], batch size: 112, lr: 5.24e-03, grad_scale: 32.0 2023-06-23 21:12:30,064 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-23 21:12:53,044 INFO [train.py:1028] (0/4) Epoch 6, validation: loss=0.2596, simple_loss=0.3528, pruned_loss=0.08322, over 1796401.00 frames. 2023-06-23 21:12:53,045 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23616MB 2023-06-23 21:12:53,609 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=950838.0, ans=0.125 2023-06-23 21:12:57,648 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.01 vs. limit=15.0 2023-06-23 21:13:02,752 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=950838.0, ans=0.125 2023-06-23 21:13:30,423 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=950898.0, ans=0.0 2023-06-23 21:13:31,008 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.38 vs. limit=22.5 2023-06-23 21:14:03,946 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.015e+02 2.620e+02 2.865e+02 3.269e+02 5.211e+02, threshold=5.729e+02, percent-clipped=1.0 2023-06-23 21:14:26,591 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.09 vs. limit=15.0 2023-06-23 21:14:48,486 INFO [train.py:996] (0/4) Epoch 6, batch 6050, loss[loss=0.1654, simple_loss=0.2438, pruned_loss=0.04349, over 21696.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2855, pruned_loss=0.06772, over 4278807.25 frames. ], batch size: 247, lr: 5.24e-03, grad_scale: 32.0 2023-06-23 21:14:50,812 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=951138.0, ans=0.125 2023-06-23 21:15:13,454 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff2.min_abs, batch_count=951198.0, ans=0.1 2023-06-23 21:15:37,979 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=951258.0, ans=0.0 2023-06-23 21:15:51,485 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=951318.0, ans=0.0 2023-06-23 21:15:51,488 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=951318.0, ans=0.125 2023-06-23 21:16:18,037 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=951378.0, ans=0.0 2023-06-23 21:16:30,420 INFO [train.py:996] (0/4) Epoch 6, batch 6100, loss[loss=0.1877, simple_loss=0.2844, pruned_loss=0.04549, over 21803.00 frames. ], tot_loss[loss=0.2081, simple_loss=0.2833, pruned_loss=0.06646, over 4282692.61 frames. ], batch size: 371, lr: 5.24e-03, grad_scale: 32.0 2023-06-23 21:16:53,833 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=951498.0, ans=0.125 2023-06-23 21:17:29,554 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=951558.0, ans=0.0 2023-06-23 21:17:40,846 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.589e+02 2.204e+02 2.422e+02 2.717e+02 3.811e+02, threshold=4.844e+02, percent-clipped=0.0 2023-06-23 21:18:18,538 INFO [train.py:996] (0/4) Epoch 6, batch 6150, loss[loss=0.2004, simple_loss=0.2718, pruned_loss=0.06455, over 21527.00 frames. ], tot_loss[loss=0.213, simple_loss=0.2867, pruned_loss=0.06967, over 4285975.11 frames. ], batch size: 195, lr: 5.24e-03, grad_scale: 32.0 2023-06-23 21:18:31,278 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=951738.0, ans=0.2 2023-06-23 21:18:36,719 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=951798.0, ans=0.125 2023-06-23 21:18:39,162 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.68 vs. limit=15.0 2023-06-23 21:20:00,408 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.71 vs. limit=15.0 2023-06-23 21:20:08,113 INFO [train.py:996] (0/4) Epoch 6, batch 6200, loss[loss=0.2116, simple_loss=0.2858, pruned_loss=0.06867, over 21381.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2892, pruned_loss=0.06905, over 4277167.59 frames. ], batch size: 159, lr: 5.24e-03, grad_scale: 16.0 2023-06-23 21:21:10,000 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.80 vs. limit=15.0 2023-06-23 21:21:15,541 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.781e+02 2.446e+02 2.781e+02 3.201e+02 6.151e+02, threshold=5.562e+02, percent-clipped=2.0 2023-06-23 21:21:18,265 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=952218.0, ans=0.0 2023-06-23 21:21:24,109 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=952218.0, ans=0.1 2023-06-23 21:21:58,214 INFO [train.py:996] (0/4) Epoch 6, batch 6250, loss[loss=0.2177, simple_loss=0.3207, pruned_loss=0.05735, over 21784.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2947, pruned_loss=0.06794, over 4273342.17 frames. ], batch size: 332, lr: 5.24e-03, grad_scale: 16.0 2023-06-23 21:22:04,651 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.42 vs. limit=15.0 2023-06-23 21:22:11,112 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=952338.0, ans=0.0 2023-06-23 21:22:11,896 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.80 vs. limit=15.0 2023-06-23 21:22:16,600 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=952398.0, ans=0.0 2023-06-23 21:22:34,482 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.79 vs. limit=22.5 2023-06-23 21:22:50,824 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=952458.0, ans=0.125 2023-06-23 21:23:45,352 INFO [train.py:996] (0/4) Epoch 6, batch 6300, loss[loss=0.2834, simple_loss=0.4037, pruned_loss=0.08149, over 20816.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.299, pruned_loss=0.06816, over 4267915.19 frames. ], batch size: 607, lr: 5.24e-03, grad_scale: 16.0 2023-06-23 21:23:56,697 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=952638.0, ans=0.2 2023-06-23 21:24:08,455 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.10 vs. limit=15.0 2023-06-23 21:24:26,714 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=952698.0, ans=0.125 2023-06-23 21:24:37,477 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=952758.0, ans=0.125 2023-06-23 21:24:57,676 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.792e+02 2.558e+02 3.046e+02 3.782e+02 6.709e+02, threshold=6.092e+02, percent-clipped=4.0 2023-06-23 21:25:18,193 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=952878.0, ans=0.0 2023-06-23 21:25:23,392 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=952878.0, ans=0.1 2023-06-23 21:25:34,561 INFO [train.py:996] (0/4) Epoch 6, batch 6350, loss[loss=0.2396, simple_loss=0.3141, pruned_loss=0.0825, over 21807.00 frames. ], tot_loss[loss=0.225, simple_loss=0.3032, pruned_loss=0.07346, over 4276822.91 frames. ], batch size: 282, lr: 5.24e-03, grad_scale: 16.0 2023-06-23 21:25:51,528 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=952938.0, ans=0.125 2023-06-23 21:26:03,775 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=952998.0, ans=0.2 2023-06-23 21:26:16,228 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=952998.0, ans=0.125 2023-06-23 21:26:18,243 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=952998.0, ans=15.0 2023-06-23 21:26:21,308 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=953058.0, ans=0.125 2023-06-23 21:27:29,901 INFO [train.py:996] (0/4) Epoch 6, batch 6400, loss[loss=0.2865, simple_loss=0.3502, pruned_loss=0.1114, over 21821.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3102, pruned_loss=0.07836, over 4276213.58 frames. ], batch size: 441, lr: 5.24e-03, grad_scale: 32.0 2023-06-23 21:28:03,429 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=953298.0, ans=0.1 2023-06-23 21:28:34,403 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=953358.0, ans=0.0 2023-06-23 21:28:42,595 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.145e+02 2.766e+02 2.997e+02 3.346e+02 4.721e+02, threshold=5.994e+02, percent-clipped=0.0 2023-06-23 21:28:43,424 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=953418.0, ans=0.1 2023-06-23 21:29:19,817 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=953478.0, ans=0.1 2023-06-23 21:29:24,305 INFO [train.py:996] (0/4) Epoch 6, batch 6450, loss[loss=0.1867, simple_loss=0.2617, pruned_loss=0.05585, over 21815.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3123, pruned_loss=0.07703, over 4277995.63 frames. ], batch size: 124, lr: 5.24e-03, grad_scale: 32.0 2023-06-23 21:31:13,657 INFO [train.py:996] (0/4) Epoch 6, batch 6500, loss[loss=0.1779, simple_loss=0.2542, pruned_loss=0.0508, over 21533.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3063, pruned_loss=0.07528, over 4272708.12 frames. ], batch size: 230, lr: 5.24e-03, grad_scale: 32.0 2023-06-23 21:31:16,238 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=953838.0, ans=0.1 2023-06-23 21:32:02,118 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=953958.0, ans=0.1 2023-06-23 21:32:14,266 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 21:32:18,654 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.973e+02 2.470e+02 2.695e+02 2.978e+02 5.209e+02, threshold=5.391e+02, percent-clipped=0.0 2023-06-23 21:32:37,546 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=954078.0, ans=0.05 2023-06-23 21:33:00,400 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=954138.0, ans=0.125 2023-06-23 21:33:01,254 INFO [train.py:996] (0/4) Epoch 6, batch 6550, loss[loss=0.1991, simple_loss=0.2813, pruned_loss=0.05844, over 21589.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.3049, pruned_loss=0.07417, over 4261659.27 frames. ], batch size: 230, lr: 5.24e-03, grad_scale: 32.0 2023-06-23 21:33:04,470 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.54 vs. limit=15.0 2023-06-23 21:33:07,608 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.63 vs. limit=15.0 2023-06-23 21:33:10,965 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.19 vs. limit=15.0 2023-06-23 21:33:45,825 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=954258.0, ans=0.0 2023-06-23 21:34:44,930 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=954378.0, ans=0.125 2023-06-23 21:34:47,732 INFO [train.py:996] (0/4) Epoch 6, batch 6600, loss[loss=0.2107, simple_loss=0.2693, pruned_loss=0.07608, over 21799.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.2996, pruned_loss=0.07385, over 4273367.52 frames. ], batch size: 98, lr: 5.23e-03, grad_scale: 8.0 2023-06-23 21:34:50,129 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=954438.0, ans=0.0 2023-06-23 21:34:52,304 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.46 vs. limit=15.0 2023-06-23 21:34:56,981 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=954438.0, ans=0.07 2023-06-23 21:36:01,787 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.786e+02 2.286e+02 2.575e+02 2.928e+02 5.219e+02, threshold=5.150e+02, percent-clipped=0.0 2023-06-23 21:36:04,108 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=954618.0, ans=0.125 2023-06-23 21:36:25,749 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=954678.0, ans=0.125 2023-06-23 21:36:35,425 INFO [train.py:996] (0/4) Epoch 6, batch 6650, loss[loss=0.2155, simple_loss=0.2574, pruned_loss=0.08677, over 20111.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2923, pruned_loss=0.0713, over 4271586.69 frames. ], batch size: 703, lr: 5.23e-03, grad_scale: 8.0 2023-06-23 21:37:20,277 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.33 vs. limit=15.0 2023-06-23 21:38:14,562 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=954978.0, ans=0.2 2023-06-23 21:38:18,829 INFO [train.py:996] (0/4) Epoch 6, batch 6700, loss[loss=0.1801, simple_loss=0.2589, pruned_loss=0.05069, over 21817.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.286, pruned_loss=0.07113, over 4267554.25 frames. ], batch size: 118, lr: 5.23e-03, grad_scale: 8.0 2023-06-23 21:39:10,541 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.38 vs. limit=12.0 2023-06-23 21:39:18,853 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=955158.0, ans=0.0 2023-06-23 21:39:34,441 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.914e+02 2.289e+02 2.607e+02 3.016e+02 4.316e+02, threshold=5.215e+02, percent-clipped=0.0 2023-06-23 21:39:41,382 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.73 vs. limit=22.5 2023-06-23 21:40:07,892 INFO [train.py:996] (0/4) Epoch 6, batch 6750, loss[loss=0.2018, simple_loss=0.2667, pruned_loss=0.06844, over 21263.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.2829, pruned_loss=0.07097, over 4263227.58 frames. ], batch size: 176, lr: 5.23e-03, grad_scale: 8.0 2023-06-23 21:40:30,768 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=955398.0, ans=0.2 2023-06-23 21:41:11,707 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=955518.0, ans=0.125 2023-06-23 21:41:49,257 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=955578.0, ans=0.125 2023-06-23 21:41:54,993 INFO [train.py:996] (0/4) Epoch 6, batch 6800, loss[loss=0.2446, simple_loss=0.3033, pruned_loss=0.09295, over 21883.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2853, pruned_loss=0.07323, over 4275243.52 frames. ], batch size: 98, lr: 5.23e-03, grad_scale: 16.0 2023-06-23 21:43:03,409 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.965e+02 2.510e+02 2.967e+02 3.494e+02 5.351e+02, threshold=5.935e+02, percent-clipped=1.0 2023-06-23 21:43:31,266 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=955878.0, ans=0.07 2023-06-23 21:43:34,608 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=955878.0, ans=0.0 2023-06-23 21:43:42,673 INFO [train.py:996] (0/4) Epoch 6, batch 6850, loss[loss=0.2542, simple_loss=0.2906, pruned_loss=0.1089, over 21463.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.284, pruned_loss=0.07423, over 4270976.27 frames. ], batch size: 509, lr: 5.23e-03, grad_scale: 16.0 2023-06-23 21:44:15,797 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=955998.0, ans=0.125 2023-06-23 21:44:47,591 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=956118.0, ans=0.125 2023-06-23 21:45:32,167 INFO [train.py:996] (0/4) Epoch 6, batch 6900, loss[loss=0.2471, simple_loss=0.3269, pruned_loss=0.08368, over 21622.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2864, pruned_loss=0.07449, over 4280534.15 frames. ], batch size: 508, lr: 5.23e-03, grad_scale: 16.0 2023-06-23 21:46:03,786 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.62 vs. limit=10.0 2023-06-23 21:46:15,484 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=956298.0, ans=0.125 2023-06-23 21:46:49,841 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.733e+02 2.526e+02 2.937e+02 3.629e+02 5.523e+02, threshold=5.874e+02, percent-clipped=0.0 2023-06-23 21:47:06,494 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=956478.0, ans=0.0 2023-06-23 21:47:27,771 INFO [train.py:996] (0/4) Epoch 6, batch 6950, loss[loss=0.2538, simple_loss=0.3298, pruned_loss=0.08889, over 21720.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2884, pruned_loss=0.07146, over 4275034.80 frames. ], batch size: 332, lr: 5.23e-03, grad_scale: 16.0 2023-06-23 21:47:50,329 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=956598.0, ans=0.015 2023-06-23 21:47:57,677 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=956598.0, ans=0.1 2023-06-23 21:48:00,879 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=956658.0, ans=0.125 2023-06-23 21:48:16,525 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=956658.0, ans=0.125 2023-06-23 21:48:39,364 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=956718.0, ans=0.125 2023-06-23 21:49:14,672 INFO [train.py:996] (0/4) Epoch 6, batch 7000, loss[loss=0.2244, simple_loss=0.2841, pruned_loss=0.08232, over 21447.00 frames. ], tot_loss[loss=0.219, simple_loss=0.2911, pruned_loss=0.07345, over 4279768.30 frames. ], batch size: 389, lr: 5.23e-03, grad_scale: 16.0 2023-06-23 21:49:21,512 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.19 vs. limit=6.0 2023-06-23 21:50:22,437 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=957018.0, ans=0.2 2023-06-23 21:50:27,290 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.073e+02 2.602e+02 2.936e+02 3.362e+02 6.122e+02, threshold=5.872e+02, percent-clipped=1.0 2023-06-23 21:51:05,581 INFO [train.py:996] (0/4) Epoch 6, batch 7050, loss[loss=0.2065, simple_loss=0.2916, pruned_loss=0.06066, over 21730.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2879, pruned_loss=0.07195, over 4282568.73 frames. ], batch size: 351, lr: 5.23e-03, grad_scale: 16.0 2023-06-23 21:51:10,041 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=957138.0, ans=0.2 2023-06-23 21:51:12,295 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.58 vs. limit=15.0 2023-06-23 21:51:32,499 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=957198.0, ans=0.0 2023-06-23 21:51:55,156 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.24 vs. limit=12.0 2023-06-23 21:51:56,233 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=957258.0, ans=0.125 2023-06-23 21:51:58,067 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=957258.0, ans=0.125 2023-06-23 21:52:08,144 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=957318.0, ans=0.1 2023-06-23 21:52:36,135 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=957378.0, ans=0.125 2023-06-23 21:52:49,881 INFO [train.py:996] (0/4) Epoch 6, batch 7100, loss[loss=0.1933, simple_loss=0.2717, pruned_loss=0.05741, over 21301.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.2934, pruned_loss=0.07349, over 4285950.11 frames. ], batch size: 176, lr: 5.23e-03, grad_scale: 16.0 2023-06-23 21:53:14,248 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.01 vs. limit=15.0 2023-06-23 21:53:53,928 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.14 vs. limit=15.0 2023-06-23 21:53:55,067 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=957618.0, ans=0.125 2023-06-23 21:53:58,851 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=957618.0, ans=0.0 2023-06-23 21:54:06,833 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.538e+02 2.381e+02 2.673e+02 3.454e+02 5.437e+02, threshold=5.346e+02, percent-clipped=0.0 2023-06-23 21:54:07,795 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=957618.0, ans=0.0 2023-06-23 21:54:35,283 INFO [train.py:996] (0/4) Epoch 6, batch 7150, loss[loss=0.1945, simple_loss=0.275, pruned_loss=0.05694, over 21763.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2902, pruned_loss=0.07059, over 4274269.36 frames. ], batch size: 332, lr: 5.23e-03, grad_scale: 16.0 2023-06-23 21:54:50,228 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=957738.0, ans=0.1 2023-06-23 21:55:15,443 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=957798.0, ans=0.125 2023-06-23 21:56:23,767 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=958038.0, ans=0.125 2023-06-23 21:56:24,641 INFO [train.py:996] (0/4) Epoch 6, batch 7200, loss[loss=0.2065, simple_loss=0.3152, pruned_loss=0.04896, over 19707.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.2934, pruned_loss=0.07265, over 4269199.81 frames. ], batch size: 703, lr: 5.23e-03, grad_scale: 32.0 2023-06-23 21:56:42,562 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=958098.0, ans=0.0 2023-06-23 21:56:59,565 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=958098.0, ans=0.125 2023-06-23 21:57:11,484 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=958158.0, ans=0.2 2023-06-23 21:57:46,111 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.971e+02 2.518e+02 2.883e+02 3.559e+02 6.632e+02, threshold=5.766e+02, percent-clipped=3.0 2023-06-23 21:57:56,609 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=958278.0, ans=0.0 2023-06-23 21:58:13,499 INFO [train.py:996] (0/4) Epoch 6, batch 7250, loss[loss=0.2228, simple_loss=0.3305, pruned_loss=0.05759, over 19813.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2905, pruned_loss=0.07276, over 4260735.25 frames. ], batch size: 703, lr: 5.22e-03, grad_scale: 32.0 2023-06-23 21:58:18,415 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.28 vs. limit=10.0 2023-06-23 21:58:35,894 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.95 vs. limit=15.0 2023-06-23 21:58:55,250 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=958398.0, ans=0.0 2023-06-23 21:59:27,142 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.07 vs. limit=15.0 2023-06-23 22:00:01,988 INFO [train.py:996] (0/4) Epoch 6, batch 7300, loss[loss=0.1815, simple_loss=0.2797, pruned_loss=0.04167, over 20781.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.2853, pruned_loss=0.07195, over 4267853.53 frames. ], batch size: 609, lr: 5.22e-03, grad_scale: 32.0 2023-06-23 22:00:06,287 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=958638.0, ans=0.04949747468305833 2023-06-23 22:00:42,089 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=958698.0, ans=0.1 2023-06-23 22:01:09,727 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.45 vs. limit=12.0 2023-06-23 22:01:12,379 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=958758.0, ans=0.1 2023-06-23 22:01:24,348 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.896e+02 2.461e+02 2.779e+02 3.106e+02 5.760e+02, threshold=5.558e+02, percent-clipped=0.0 2023-06-23 22:01:34,285 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=958878.0, ans=0.125 2023-06-23 22:01:51,242 INFO [train.py:996] (0/4) Epoch 6, batch 7350, loss[loss=0.2454, simple_loss=0.3097, pruned_loss=0.09053, over 21550.00 frames. ], tot_loss[loss=0.2156, simple_loss=0.2844, pruned_loss=0.07342, over 4262474.11 frames. ], batch size: 389, lr: 5.22e-03, grad_scale: 16.0 2023-06-23 22:01:59,501 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=958938.0, ans=0.0 2023-06-23 22:03:37,928 INFO [train.py:996] (0/4) Epoch 6, batch 7400, loss[loss=0.2108, simple_loss=0.2947, pruned_loss=0.06346, over 21690.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.2908, pruned_loss=0.07602, over 4265775.09 frames. ], batch size: 247, lr: 5.22e-03, grad_scale: 16.0 2023-06-23 22:03:56,095 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.70 vs. limit=15.0 2023-06-23 22:04:09,545 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=959298.0, ans=0.0 2023-06-23 22:04:36,135 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=959358.0, ans=0.2 2023-06-23 22:04:39,215 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=959358.0, ans=0.125 2023-06-23 22:04:56,486 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=959418.0, ans=0.04949747468305833 2023-06-23 22:05:00,960 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.952e+02 2.692e+02 3.073e+02 3.719e+02 6.060e+02, threshold=6.147e+02, percent-clipped=2.0 2023-06-23 22:05:03,806 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=959418.0, ans=0.2 2023-06-23 22:05:39,079 INFO [train.py:996] (0/4) Epoch 6, batch 7450, loss[loss=0.2583, simple_loss=0.3039, pruned_loss=0.1064, over 21371.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.2888, pruned_loss=0.07463, over 4269184.82 frames. ], batch size: 473, lr: 5.22e-03, grad_scale: 16.0 2023-06-23 22:06:06,137 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=959598.0, ans=0.125 2023-06-23 22:06:06,158 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=959598.0, ans=0.0 2023-06-23 22:06:29,098 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=959658.0, ans=0.125 2023-06-23 22:06:38,595 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.01 vs. limit=10.0 2023-06-23 22:06:43,152 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=959718.0, ans=0.125 2023-06-23 22:07:30,944 INFO [train.py:996] (0/4) Epoch 6, batch 7500, loss[loss=0.3004, simple_loss=0.3906, pruned_loss=0.1052, over 21655.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.2953, pruned_loss=0.07673, over 4270739.78 frames. ], batch size: 441, lr: 5.22e-03, grad_scale: 16.0 2023-06-23 22:07:54,172 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=959838.0, ans=0.95 2023-06-23 22:08:20,489 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=959958.0, ans=0.1 2023-06-23 22:08:24,201 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=959958.0, ans=0.125 2023-06-23 22:08:27,041 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-160000.pt 2023-06-23 22:08:44,093 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.946e+02 2.824e+02 3.431e+02 4.118e+02 7.261e+02, threshold=6.863e+02, percent-clipped=3.0 2023-06-23 22:09:20,921 INFO [train.py:996] (0/4) Epoch 6, batch 7550, loss[loss=0.2085, simple_loss=0.3025, pruned_loss=0.05728, over 21639.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3029, pruned_loss=0.07494, over 4280527.67 frames. ], batch size: 230, lr: 5.22e-03, grad_scale: 16.0 2023-06-23 22:10:49,750 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.48 vs. limit=15.0 2023-06-23 22:11:02,190 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=960378.0, ans=0.125 2023-06-23 22:11:02,750 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=960378.0, ans=15.0 2023-06-23 22:11:08,232 INFO [train.py:996] (0/4) Epoch 6, batch 7600, loss[loss=0.2329, simple_loss=0.2998, pruned_loss=0.08305, over 21359.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.3014, pruned_loss=0.07362, over 4286280.91 frames. ], batch size: 143, lr: 5.22e-03, grad_scale: 32.0 2023-06-23 22:11:46,766 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=960498.0, ans=0.0 2023-06-23 22:11:46,863 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=960498.0, ans=0.0 2023-06-23 22:12:18,813 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.780e+02 2.489e+02 2.806e+02 3.400e+02 5.423e+02, threshold=5.611e+02, percent-clipped=0.0 2023-06-23 22:12:19,794 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=960618.0, ans=0.2 2023-06-23 22:12:43,060 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=960678.0, ans=0.125 2023-06-23 22:12:56,163 INFO [train.py:996] (0/4) Epoch 6, batch 7650, loss[loss=0.2526, simple_loss=0.3207, pruned_loss=0.09228, over 21882.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.2995, pruned_loss=0.07467, over 4283081.49 frames. ], batch size: 118, lr: 5.22e-03, grad_scale: 32.0 2023-06-23 22:13:07,326 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=960738.0, ans=0.1 2023-06-23 22:13:45,271 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=960858.0, ans=0.125 2023-06-23 22:13:50,581 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=960858.0, ans=0.0 2023-06-23 22:14:01,291 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=960918.0, ans=0.125 2023-06-23 22:14:25,220 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=960918.0, ans=0.2 2023-06-23 22:14:50,485 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=961038.0, ans=0.125 2023-06-23 22:14:51,537 INFO [train.py:996] (0/4) Epoch 6, batch 7700, loss[loss=0.2379, simple_loss=0.3001, pruned_loss=0.08786, over 21817.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3024, pruned_loss=0.07756, over 4289119.91 frames. ], batch size: 441, lr: 5.22e-03, grad_scale: 32.0 2023-06-23 22:15:03,350 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.85 vs. limit=22.5 2023-06-23 22:15:42,604 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=961158.0, ans=0.1 2023-06-23 22:16:06,421 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.909e+02 2.621e+02 3.080e+02 3.761e+02 5.045e+02, threshold=6.161e+02, percent-clipped=0.0 2023-06-23 22:16:43,621 INFO [train.py:996] (0/4) Epoch 6, batch 7750, loss[loss=0.2674, simple_loss=0.363, pruned_loss=0.0859, over 21748.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3097, pruned_loss=0.07833, over 4288290.91 frames. ], batch size: 351, lr: 5.22e-03, grad_scale: 32.0 2023-06-23 22:17:04,617 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.52 vs. limit=6.0 2023-06-23 22:17:44,925 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.95 vs. limit=15.0 2023-06-23 22:17:52,213 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.95 vs. limit=15.0 2023-06-23 22:17:53,349 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=961518.0, ans=0.025 2023-06-23 22:18:21,063 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 22:18:26,334 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=961578.0, ans=0.0 2023-06-23 22:18:34,451 INFO [train.py:996] (0/4) Epoch 6, batch 7800, loss[loss=0.2058, simple_loss=0.2688, pruned_loss=0.07136, over 21570.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3118, pruned_loss=0.07909, over 4289802.89 frames. ], batch size: 230, lr: 5.22e-03, grad_scale: 32.0 2023-06-23 22:18:45,689 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.54 vs. limit=15.0 2023-06-23 22:19:01,633 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.70 vs. limit=15.0 2023-06-23 22:19:22,107 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=961758.0, ans=0.125 2023-06-23 22:19:45,758 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.173e+02 2.845e+02 3.471e+02 4.135e+02 7.731e+02, threshold=6.941e+02, percent-clipped=4.0 2023-06-23 22:20:10,765 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.52 vs. limit=15.0 2023-06-23 22:20:21,475 INFO [train.py:996] (0/4) Epoch 6, batch 7850, loss[loss=0.213, simple_loss=0.2603, pruned_loss=0.08288, over 20317.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.3039, pruned_loss=0.07818, over 4269140.16 frames. ], batch size: 703, lr: 5.21e-03, grad_scale: 32.0 2023-06-23 22:20:24,018 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=961938.0, ans=0.125 2023-06-23 22:20:27,382 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=961938.0, ans=0.0 2023-06-23 22:22:15,675 INFO [train.py:996] (0/4) Epoch 6, batch 7900, loss[loss=0.1474, simple_loss=0.1804, pruned_loss=0.05715, over 16214.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.2992, pruned_loss=0.07814, over 4255170.75 frames. ], batch size: 61, lr: 5.21e-03, grad_scale: 32.0 2023-06-23 22:22:29,195 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=962238.0, ans=0.125 2023-06-23 22:22:31,413 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=962238.0, ans=0.125 2023-06-23 22:23:36,788 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.156e+02 2.814e+02 3.173e+02 3.712e+02 6.452e+02, threshold=6.346e+02, percent-clipped=0.0 2023-06-23 22:23:58,987 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=962478.0, ans=0.125 2023-06-23 22:24:03,344 INFO [train.py:996] (0/4) Epoch 6, batch 7950, loss[loss=0.195, simple_loss=0.2646, pruned_loss=0.0627, over 20773.00 frames. ], tot_loss[loss=0.229, simple_loss=0.3036, pruned_loss=0.07717, over 4251787.08 frames. ], batch size: 609, lr: 5.21e-03, grad_scale: 32.0 2023-06-23 22:24:46,954 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=962658.0, ans=0.0 2023-06-23 22:24:57,638 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=962658.0, ans=0.04949747468305833 2023-06-23 22:25:02,170 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.81 vs. limit=15.0 2023-06-23 22:25:27,240 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=962718.0, ans=0.05 2023-06-23 22:25:58,611 INFO [train.py:996] (0/4) Epoch 6, batch 8000, loss[loss=0.2416, simple_loss=0.3588, pruned_loss=0.06223, over 20769.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3092, pruned_loss=0.07866, over 4253849.04 frames. ], batch size: 607, lr: 5.21e-03, grad_scale: 32.0 2023-06-23 22:27:05,362 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=962958.0, ans=0.0 2023-06-23 22:27:20,916 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.094e+02 2.660e+02 3.200e+02 3.986e+02 6.358e+02, threshold=6.400e+02, percent-clipped=1.0 2023-06-23 22:27:22,180 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.15 vs. limit=15.0 2023-06-23 22:27:47,803 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=963078.0, ans=0.125 2023-06-23 22:27:57,708 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.63 vs. limit=6.0 2023-06-23 22:27:59,893 INFO [train.py:996] (0/4) Epoch 6, batch 8050, loss[loss=0.2909, simple_loss=0.3756, pruned_loss=0.1032, over 21613.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3144, pruned_loss=0.08, over 4251239.62 frames. ], batch size: 441, lr: 5.21e-03, grad_scale: 32.0 2023-06-23 22:28:04,337 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=963138.0, ans=0.125 2023-06-23 22:29:41,772 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=963378.0, ans=0.125 2023-06-23 22:29:51,720 INFO [train.py:996] (0/4) Epoch 6, batch 8100, loss[loss=0.229, simple_loss=0.2972, pruned_loss=0.08036, over 21015.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3115, pruned_loss=0.08011, over 4260667.23 frames. ], batch size: 608, lr: 5.21e-03, grad_scale: 32.0 2023-06-23 22:31:23,064 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.199e+02 2.897e+02 3.319e+02 4.086e+02 8.225e+02, threshold=6.637e+02, percent-clipped=1.0 2023-06-23 22:31:58,992 INFO [train.py:996] (0/4) Epoch 6, batch 8150, loss[loss=0.2391, simple_loss=0.348, pruned_loss=0.06504, over 21779.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.3187, pruned_loss=0.08099, over 4261007.69 frames. ], batch size: 352, lr: 5.21e-03, grad_scale: 16.0 2023-06-23 22:32:08,534 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=963738.0, ans=0.125 2023-06-23 22:32:37,842 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.04 vs. limit=12.0 2023-06-23 22:32:46,584 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.53 vs. limit=15.0 2023-06-23 22:33:03,503 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=963918.0, ans=0.1 2023-06-23 22:33:06,732 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=963918.0, ans=0.0 2023-06-23 22:33:12,447 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.33 vs. limit=12.0 2023-06-23 22:33:33,150 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=963978.0, ans=0.125 2023-06-23 22:33:48,079 INFO [train.py:996] (0/4) Epoch 6, batch 8200, loss[loss=0.2635, simple_loss=0.3052, pruned_loss=0.1109, over 21415.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3102, pruned_loss=0.07907, over 4256106.80 frames. ], batch size: 509, lr: 5.21e-03, grad_scale: 16.0 2023-06-23 22:34:42,260 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=964158.0, ans=0.125 2023-06-23 22:34:47,163 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=964158.0, ans=0.125 2023-06-23 22:35:09,645 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.036e+02 2.474e+02 2.845e+02 3.510e+02 6.334e+02, threshold=5.689e+02, percent-clipped=0.0 2023-06-23 22:35:39,841 INFO [train.py:996] (0/4) Epoch 6, batch 8250, loss[loss=0.213, simple_loss=0.2967, pruned_loss=0.06469, over 21431.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3068, pruned_loss=0.07885, over 4251519.06 frames. ], batch size: 131, lr: 5.21e-03, grad_scale: 16.0 2023-06-23 22:36:27,778 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=964458.0, ans=10.0 2023-06-23 22:36:34,884 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=964458.0, ans=0.1 2023-06-23 22:36:58,777 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.90 vs. limit=15.0 2023-06-23 22:37:30,676 INFO [train.py:996] (0/4) Epoch 6, batch 8300, loss[loss=0.2161, simple_loss=0.2909, pruned_loss=0.07065, over 21245.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.3047, pruned_loss=0.0754, over 4251054.89 frames. ], batch size: 176, lr: 5.21e-03, grad_scale: 16.0 2023-06-23 22:37:39,794 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=964638.0, ans=0.025 2023-06-23 22:38:49,166 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.813e+02 2.358e+02 2.866e+02 3.291e+02 6.256e+02, threshold=5.732e+02, percent-clipped=2.0 2023-06-23 22:39:19,239 INFO [train.py:996] (0/4) Epoch 6, batch 8350, loss[loss=0.2164, simple_loss=0.2894, pruned_loss=0.07167, over 21784.00 frames. ], tot_loss[loss=0.225, simple_loss=0.3039, pruned_loss=0.07307, over 4250627.78 frames. ], batch size: 372, lr: 5.21e-03, grad_scale: 16.0 2023-06-23 22:40:50,572 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=965178.0, ans=0.0 2023-06-23 22:41:01,131 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.48 vs. limit=22.5 2023-06-23 22:41:01,167 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.69 vs. limit=12.0 2023-06-23 22:41:08,745 INFO [train.py:996] (0/4) Epoch 6, batch 8400, loss[loss=0.2216, simple_loss=0.2647, pruned_loss=0.08929, over 20004.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.3005, pruned_loss=0.07094, over 4241487.16 frames. ], batch size: 703, lr: 5.21e-03, grad_scale: 32.0 2023-06-23 22:41:16,355 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=965238.0, ans=0.0 2023-06-23 22:42:24,697 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=965418.0, ans=0.05 2023-06-23 22:42:27,774 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.585e+02 2.294e+02 2.573e+02 3.024e+02 4.553e+02, threshold=5.145e+02, percent-clipped=0.0 2023-06-23 22:42:55,768 INFO [train.py:996] (0/4) Epoch 6, batch 8450, loss[loss=0.2419, simple_loss=0.3081, pruned_loss=0.0878, over 21235.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.2991, pruned_loss=0.07135, over 4254255.74 frames. ], batch size: 143, lr: 5.20e-03, grad_scale: 16.0 2023-06-23 22:44:10,114 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=965718.0, ans=0.5 2023-06-23 22:44:33,364 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=965778.0, ans=0.0 2023-06-23 22:44:44,908 INFO [train.py:996] (0/4) Epoch 6, batch 8500, loss[loss=0.2198, simple_loss=0.2748, pruned_loss=0.08241, over 21209.00 frames. ], tot_loss[loss=0.221, simple_loss=0.2954, pruned_loss=0.07328, over 4253785.39 frames. ], batch size: 159, lr: 5.20e-03, grad_scale: 16.0 2023-06-23 22:44:49,120 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=965838.0, ans=0.0 2023-06-23 22:44:52,706 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=965838.0, ans=0.125 2023-06-23 22:45:41,848 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=965958.0, ans=0.05 2023-06-23 22:46:11,484 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.94 vs. limit=15.0 2023-06-23 22:46:13,504 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.818e+02 2.833e+02 3.387e+02 4.039e+02 6.147e+02, threshold=6.774e+02, percent-clipped=7.0 2023-06-23 22:46:29,394 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=2.99 vs. limit=12.0 2023-06-23 22:46:36,911 INFO [train.py:996] (0/4) Epoch 6, batch 8550, loss[loss=0.2273, simple_loss=0.3154, pruned_loss=0.06961, over 21616.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.2995, pruned_loss=0.07548, over 4256289.31 frames. ], batch size: 263, lr: 5.20e-03, grad_scale: 16.0 2023-06-23 22:47:19,565 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=966198.0, ans=0.125 2023-06-23 22:47:41,321 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.34 vs. limit=12.0 2023-06-23 22:47:49,353 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=966318.0, ans=0.125 2023-06-23 22:48:22,082 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 22:48:34,782 INFO [train.py:996] (0/4) Epoch 6, batch 8600, loss[loss=0.1891, simple_loss=0.2297, pruned_loss=0.07424, over 20018.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.3082, pruned_loss=0.07817, over 4263817.97 frames. ], batch size: 704, lr: 5.20e-03, grad_scale: 16.0 2023-06-23 22:48:48,996 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=966438.0, ans=0.0 2023-06-23 22:48:49,628 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.26 vs. limit=15.0 2023-06-23 22:49:28,413 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=966558.0, ans=0.0 2023-06-23 22:49:29,896 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 22:49:57,424 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.156e+02 2.875e+02 3.260e+02 4.247e+02 6.190e+02, threshold=6.520e+02, percent-clipped=0.0 2023-06-23 22:50:31,076 INFO [train.py:996] (0/4) Epoch 6, batch 8650, loss[loss=0.2233, simple_loss=0.2993, pruned_loss=0.07362, over 20812.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3113, pruned_loss=0.07927, over 4258792.01 frames. ], batch size: 607, lr: 5.20e-03, grad_scale: 16.0 2023-06-23 22:50:38,467 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=966738.0, ans=0.0 2023-06-23 22:50:39,010 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.02 vs. limit=12.0 2023-06-23 22:51:12,676 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=966858.0, ans=0.2 2023-06-23 22:51:41,354 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.43 vs. limit=15.0 2023-06-23 22:52:05,461 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=966978.0, ans=0.0 2023-06-23 22:52:05,581 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=966978.0, ans=0.1 2023-06-23 22:52:07,498 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.01 vs. limit=15.0 2023-06-23 22:52:13,206 INFO [train.py:996] (0/4) Epoch 6, batch 8700, loss[loss=0.2139, simple_loss=0.282, pruned_loss=0.07293, over 21793.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.3028, pruned_loss=0.07488, over 4253489.55 frames. ], batch size: 98, lr: 5.20e-03, grad_scale: 16.0 2023-06-23 22:53:30,832 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 22:53:32,510 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=967218.0, ans=0.125 2023-06-23 22:53:33,443 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.609e+02 2.283e+02 2.590e+02 3.172e+02 4.476e+02, threshold=5.179e+02, percent-clipped=0.0 2023-06-23 22:53:45,498 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=967278.0, ans=0.0 2023-06-23 22:54:08,956 INFO [train.py:996] (0/4) Epoch 6, batch 8750, loss[loss=0.2166, simple_loss=0.2832, pruned_loss=0.07504, over 21472.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.2991, pruned_loss=0.07474, over 4256975.05 frames. ], batch size: 194, lr: 5.20e-03, grad_scale: 16.0 2023-06-23 22:54:22,555 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=967338.0, ans=0.125 2023-06-23 22:54:36,784 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.13 vs. limit=10.0 2023-06-23 22:54:45,203 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=967398.0, ans=0.125 2023-06-23 22:54:48,839 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=967458.0, ans=0.125 2023-06-23 22:56:02,222 INFO [train.py:996] (0/4) Epoch 6, batch 8800, loss[loss=0.2866, simple_loss=0.3682, pruned_loss=0.1025, over 21847.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3085, pruned_loss=0.07794, over 4260023.92 frames. ], batch size: 118, lr: 5.20e-03, grad_scale: 32.0 2023-06-23 22:57:28,540 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.027e+02 2.723e+02 3.088e+02 3.591e+02 5.183e+02, threshold=6.177e+02, percent-clipped=1.0 2023-06-23 22:57:56,328 INFO [train.py:996] (0/4) Epoch 6, batch 8850, loss[loss=0.2788, simple_loss=0.3575, pruned_loss=0.1, over 21388.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3169, pruned_loss=0.08008, over 4256028.23 frames. ], batch size: 131, lr: 5.20e-03, grad_scale: 32.0 2023-06-23 22:58:16,874 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.05 vs. limit=12.0 2023-06-23 22:58:27,560 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.45 vs. limit=15.0 2023-06-23 22:59:46,052 INFO [train.py:996] (0/4) Epoch 6, batch 8900, loss[loss=0.2294, simple_loss=0.2983, pruned_loss=0.08024, over 21795.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3108, pruned_loss=0.07846, over 4250766.06 frames. ], batch size: 102, lr: 5.20e-03, grad_scale: 32.0 2023-06-23 23:00:26,380 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.25 vs. limit=15.0 2023-06-23 23:00:35,264 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=968358.0, ans=0.0 2023-06-23 23:00:56,241 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.40 vs. limit=6.0 2023-06-23 23:01:18,318 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.874e+02 2.656e+02 3.141e+02 3.730e+02 7.900e+02, threshold=6.282e+02, percent-clipped=6.0 2023-06-23 23:01:39,337 INFO [train.py:996] (0/4) Epoch 6, batch 8950, loss[loss=0.2007, simple_loss=0.2608, pruned_loss=0.07028, over 21212.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3103, pruned_loss=0.07818, over 4255764.80 frames. ], batch size: 176, lr: 5.20e-03, grad_scale: 16.0 2023-06-23 23:01:46,942 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=968538.0, ans=0.125 2023-06-23 23:03:29,125 INFO [train.py:996] (0/4) Epoch 6, batch 9000, loss[loss=0.2038, simple_loss=0.263, pruned_loss=0.07233, over 21815.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.3035, pruned_loss=0.07777, over 4261896.27 frames. ], batch size: 124, lr: 5.20e-03, grad_scale: 16.0 2023-06-23 23:03:29,127 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-23 23:03:48,709 INFO [train.py:1028] (0/4) Epoch 6, validation: loss=0.2652, simple_loss=0.3551, pruned_loss=0.08764, over 1796401.00 frames. 2023-06-23 23:03:48,711 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23616MB 2023-06-23 23:03:51,449 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=968838.0, ans=0.2 2023-06-23 23:03:54,504 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=968838.0, ans=0.125 2023-06-23 23:03:58,164 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=968838.0, ans=0.0 2023-06-23 23:04:36,361 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=968958.0, ans=0.125 2023-06-23 23:05:12,838 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.595e+02 2.551e+02 3.018e+02 3.495e+02 6.048e+02, threshold=6.037e+02, percent-clipped=0.0 2023-06-23 23:05:30,147 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=969078.0, ans=0.125 2023-06-23 23:05:45,389 INFO [train.py:996] (0/4) Epoch 6, batch 9050, loss[loss=0.2624, simple_loss=0.3392, pruned_loss=0.09285, over 21754.00 frames. ], tot_loss[loss=0.225, simple_loss=0.2999, pruned_loss=0.07502, over 4261206.90 frames. ], batch size: 124, lr: 5.20e-03, grad_scale: 16.0 2023-06-23 23:05:55,384 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=969138.0, ans=0.1 2023-06-23 23:07:34,817 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=969378.0, ans=0.125 2023-06-23 23:07:39,250 INFO [train.py:996] (0/4) Epoch 6, batch 9100, loss[loss=0.2037, simple_loss=0.2967, pruned_loss=0.05536, over 21309.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3071, pruned_loss=0.07785, over 4257433.79 frames. ], batch size: 176, lr: 5.19e-03, grad_scale: 16.0 2023-06-23 23:08:09,382 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=969498.0, ans=0.125 2023-06-23 23:08:30,517 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=969558.0, ans=0.0 2023-06-23 23:09:04,185 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.712e+02 2.470e+02 2.760e+02 3.335e+02 5.659e+02, threshold=5.519e+02, percent-clipped=0.0 2023-06-23 23:09:30,884 INFO [train.py:996] (0/4) Epoch 6, batch 9150, loss[loss=0.2255, simple_loss=0.3085, pruned_loss=0.07126, over 21450.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3081, pruned_loss=0.0744, over 4269365.16 frames. ], batch size: 211, lr: 5.19e-03, grad_scale: 16.0 2023-06-23 23:09:31,568 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=969738.0, ans=0.2 2023-06-23 23:09:54,340 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.56 vs. limit=10.0 2023-06-23 23:10:19,928 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=969858.0, ans=0.2 2023-06-23 23:11:12,009 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=969978.0, ans=0.125 2023-06-23 23:11:22,078 INFO [train.py:996] (0/4) Epoch 6, batch 9200, loss[loss=0.2689, simple_loss=0.3404, pruned_loss=0.09867, over 21814.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3107, pruned_loss=0.07404, over 4277999.04 frames. ], batch size: 124, lr: 5.19e-03, grad_scale: 32.0 2023-06-23 23:11:32,740 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.68 vs. limit=15.0 2023-06-23 23:12:05,553 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=970098.0, ans=0.0 2023-06-23 23:12:20,036 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=970158.0, ans=0.2 2023-06-23 23:12:51,024 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.987e+02 2.565e+02 2.927e+02 3.982e+02 7.343e+02, threshold=5.853e+02, percent-clipped=8.0 2023-06-23 23:13:17,989 INFO [train.py:996] (0/4) Epoch 6, batch 9250, loss[loss=0.2325, simple_loss=0.2969, pruned_loss=0.08402, over 21451.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.314, pruned_loss=0.07744, over 4280502.27 frames. ], batch size: 389, lr: 5.19e-03, grad_scale: 32.0 2023-06-23 23:13:27,983 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.68 vs. limit=15.0 2023-06-23 23:13:57,268 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=970458.0, ans=0.125 2023-06-23 23:14:02,390 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=970458.0, ans=0.1 2023-06-23 23:14:41,576 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=970518.0, ans=0.125 2023-06-23 23:15:01,668 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=970578.0, ans=0.1 2023-06-23 23:15:15,156 INFO [train.py:996] (0/4) Epoch 6, batch 9300, loss[loss=0.1973, simple_loss=0.2685, pruned_loss=0.06302, over 21767.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3077, pruned_loss=0.07669, over 4275262.23 frames. ], batch size: 124, lr: 5.19e-03, grad_scale: 32.0 2023-06-23 23:15:23,225 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=970638.0, ans=0.0 2023-06-23 23:15:25,033 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=970638.0, ans=0.125 2023-06-23 23:15:41,812 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=970698.0, ans=15.0 2023-06-23 23:15:43,742 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.14 vs. limit=6.0 2023-06-23 23:15:51,763 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=970698.0, ans=0.2 2023-06-23 23:16:11,662 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.75 vs. limit=15.0 2023-06-23 23:16:33,207 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.072e+02 2.705e+02 3.300e+02 3.579e+02 5.908e+02, threshold=6.601e+02, percent-clipped=1.0 2023-06-23 23:17:06,345 INFO [train.py:996] (0/4) Epoch 6, batch 9350, loss[loss=0.2602, simple_loss=0.3431, pruned_loss=0.08865, over 21805.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3132, pruned_loss=0.07773, over 4276291.57 frames. ], batch size: 118, lr: 5.19e-03, grad_scale: 32.0 2023-06-23 23:17:37,477 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=970998.0, ans=0.2 2023-06-23 23:17:40,089 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.05 vs. limit=5.0 2023-06-23 23:18:25,915 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.92 vs. limit=15.0 2023-06-23 23:18:57,442 INFO [train.py:996] (0/4) Epoch 6, batch 9400, loss[loss=0.2339, simple_loss=0.2946, pruned_loss=0.08657, over 21532.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3157, pruned_loss=0.07809, over 4272052.40 frames. ], batch size: 441, lr: 5.19e-03, grad_scale: 32.0 2023-06-23 23:19:17,148 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=971238.0, ans=0.04949747468305833 2023-06-23 23:19:46,165 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=971358.0, ans=0.125 2023-06-23 23:20:25,048 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.059e+02 2.477e+02 2.813e+02 3.524e+02 8.030e+02, threshold=5.626e+02, percent-clipped=3.0 2023-06-23 23:20:46,126 INFO [train.py:996] (0/4) Epoch 6, batch 9450, loss[loss=0.2031, simple_loss=0.2652, pruned_loss=0.07056, over 21550.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3075, pruned_loss=0.07681, over 4267258.06 frames. ], batch size: 195, lr: 5.19e-03, grad_scale: 16.0 2023-06-23 23:22:00,444 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=971718.0, ans=0.125 2023-06-23 23:22:00,515 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=971718.0, ans=0.1 2023-06-23 23:22:14,242 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=971778.0, ans=0.04949747468305833 2023-06-23 23:22:24,866 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=971778.0, ans=0.0 2023-06-23 23:22:29,378 INFO [train.py:996] (0/4) Epoch 6, batch 9500, loss[loss=0.1869, simple_loss=0.2742, pruned_loss=0.04983, over 21707.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.2996, pruned_loss=0.07477, over 4264967.69 frames. ], batch size: 332, lr: 5.19e-03, grad_scale: 16.0 2023-06-23 23:22:44,816 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=971838.0, ans=0.0 2023-06-23 23:23:30,435 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.94 vs. limit=6.0 2023-06-23 23:23:46,469 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.65 vs. limit=15.0 2023-06-23 23:23:47,365 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=972018.0, ans=0.1 2023-06-23 23:23:55,323 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.823e+02 2.481e+02 2.768e+02 3.385e+02 5.932e+02, threshold=5.537e+02, percent-clipped=1.0 2023-06-23 23:24:20,138 INFO [train.py:996] (0/4) Epoch 6, batch 9550, loss[loss=0.2254, simple_loss=0.3121, pruned_loss=0.06937, over 16367.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3045, pruned_loss=0.0772, over 4261171.29 frames. ], batch size: 60, lr: 5.19e-03, grad_scale: 16.0 2023-06-23 23:24:37,947 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 23:25:22,450 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=972318.0, ans=0.125 2023-06-23 23:25:31,357 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=972318.0, ans=0.2 2023-06-23 23:25:56,667 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.68 vs. limit=15.0 2023-06-23 23:26:02,072 INFO [train.py:996] (0/4) Epoch 6, batch 9600, loss[loss=0.2188, simple_loss=0.296, pruned_loss=0.07083, over 21855.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3076, pruned_loss=0.07793, over 4268799.82 frames. ], batch size: 351, lr: 5.19e-03, grad_scale: 32.0 2023-06-23 23:26:23,394 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=972438.0, ans=0.2 2023-06-23 23:26:25,203 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=972438.0, ans=0.125 2023-06-23 23:26:35,337 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=972498.0, ans=0.1 2023-06-23 23:26:38,913 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=972498.0, ans=0.1 2023-06-23 23:27:32,661 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.956e+02 2.542e+02 2.834e+02 3.285e+02 4.885e+02, threshold=5.668e+02, percent-clipped=0.0 2023-06-23 23:27:55,434 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=972738.0, ans=0.025 2023-06-23 23:28:01,848 INFO [train.py:996] (0/4) Epoch 6, batch 9650, loss[loss=0.2503, simple_loss=0.3246, pruned_loss=0.088, over 21631.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3085, pruned_loss=0.0783, over 4267976.86 frames. ], batch size: 389, lr: 5.19e-03, grad_scale: 16.0 2023-06-23 23:28:02,270 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 23:28:32,188 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=972798.0, ans=0.04949747468305833 2023-06-23 23:29:50,930 INFO [train.py:996] (0/4) Epoch 6, batch 9700, loss[loss=0.2391, simple_loss=0.302, pruned_loss=0.08809, over 21831.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3096, pruned_loss=0.07817, over 4253848.00 frames. ], batch size: 112, lr: 5.18e-03, grad_scale: 16.0 2023-06-23 23:30:12,336 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.27 vs. limit=15.0 2023-06-23 23:30:51,007 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=973158.0, ans=6.0 2023-06-23 23:31:10,722 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.037e+02 2.422e+02 2.744e+02 3.326e+02 5.586e+02, threshold=5.488e+02, percent-clipped=0.0 2023-06-23 23:31:12,994 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=973278.0, ans=0.0 2023-06-23 23:31:38,377 INFO [train.py:996] (0/4) Epoch 6, batch 9750, loss[loss=0.1992, simple_loss=0.2657, pruned_loss=0.06629, over 21635.00 frames. ], tot_loss[loss=0.228, simple_loss=0.3024, pruned_loss=0.07683, over 4256498.65 frames. ], batch size: 298, lr: 5.18e-03, grad_scale: 16.0 2023-06-23 23:31:51,310 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=973338.0, ans=0.125 2023-06-23 23:31:54,726 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=973398.0, ans=0.2 2023-06-23 23:31:59,996 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=973398.0, ans=0.125 2023-06-23 23:32:05,849 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=7.12 vs. limit=8.0 2023-06-23 23:32:58,376 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=973578.0, ans=0.125 2023-06-23 23:33:03,929 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.65 vs. limit=15.0 2023-06-23 23:33:19,468 INFO [train.py:996] (0/4) Epoch 6, batch 9800, loss[loss=0.2539, simple_loss=0.3137, pruned_loss=0.09706, over 21594.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3031, pruned_loss=0.07696, over 4253648.28 frames. ], batch size: 471, lr: 5.18e-03, grad_scale: 16.0 2023-06-23 23:34:14,481 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=973758.0, ans=0.125 2023-06-23 23:34:35,682 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=973818.0, ans=0.2 2023-06-23 23:34:39,324 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=973818.0, ans=0.125 2023-06-23 23:34:45,287 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.997e+02 2.591e+02 2.983e+02 3.638e+02 9.651e+02, threshold=5.966e+02, percent-clipped=4.0 2023-06-23 23:34:58,106 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=973878.0, ans=0.0 2023-06-23 23:35:07,708 INFO [train.py:996] (0/4) Epoch 6, batch 9850, loss[loss=0.1871, simple_loss=0.2482, pruned_loss=0.06305, over 21224.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.2994, pruned_loss=0.07696, over 4262022.41 frames. ], batch size: 176, lr: 5.18e-03, grad_scale: 16.0 2023-06-23 23:35:13,436 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=973938.0, ans=0.0 2023-06-23 23:35:39,723 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=973998.0, ans=0.125 2023-06-23 23:35:47,611 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.11 vs. limit=15.0 2023-06-23 23:36:05,378 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=974058.0, ans=0.125 2023-06-23 23:36:25,427 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=974118.0, ans=0.0 2023-06-23 23:36:43,162 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=974178.0, ans=0.0 2023-06-23 23:36:47,070 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=974178.0, ans=0.125 2023-06-23 23:36:57,008 INFO [train.py:996] (0/4) Epoch 6, batch 9900, loss[loss=0.2619, simple_loss=0.324, pruned_loss=0.09996, over 21576.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.2958, pruned_loss=0.07659, over 4259411.64 frames. ], batch size: 414, lr: 5.18e-03, grad_scale: 16.0 2023-06-23 23:37:04,506 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=974238.0, ans=0.125 2023-06-23 23:37:22,657 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=974298.0, ans=0.0 2023-06-23 23:37:23,227 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.86 vs. limit=15.0 2023-06-23 23:37:29,636 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=974298.0, ans=0.125 2023-06-23 23:38:18,546 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.48 vs. limit=15.0 2023-06-23 23:38:23,768 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.968e+02 2.567e+02 2.955e+02 3.451e+02 4.751e+02, threshold=5.911e+02, percent-clipped=0.0 2023-06-23 23:38:31,577 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=974478.0, ans=0.125 2023-06-23 23:38:46,974 INFO [train.py:996] (0/4) Epoch 6, batch 9950, loss[loss=0.2684, simple_loss=0.3065, pruned_loss=0.1151, over 21404.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.2984, pruned_loss=0.07913, over 4266490.41 frames. ], batch size: 510, lr: 5.18e-03, grad_scale: 16.0 2023-06-23 23:39:11,160 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=974598.0, ans=0.07 2023-06-23 23:39:16,590 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=974598.0, ans=0.1 2023-06-23 23:39:28,920 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=974598.0, ans=0.0 2023-06-23 23:40:11,919 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=974718.0, ans=0.125 2023-06-23 23:40:31,911 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.36 vs. limit=22.5 2023-06-23 23:40:43,772 INFO [train.py:996] (0/4) Epoch 6, batch 10000, loss[loss=0.2407, simple_loss=0.2959, pruned_loss=0.09277, over 21454.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.2958, pruned_loss=0.07806, over 4253249.86 frames. ], batch size: 509, lr: 5.18e-03, grad_scale: 32.0 2023-06-23 23:40:49,899 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=974838.0, ans=0.125 2023-06-23 23:40:55,172 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=974838.0, ans=0.95 2023-06-23 23:42:09,999 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=975078.0, ans=0.0 2023-06-23 23:42:10,930 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.070e+02 2.477e+02 2.945e+02 3.555e+02 6.332e+02, threshold=5.891e+02, percent-clipped=1.0 2023-06-23 23:42:34,452 INFO [train.py:996] (0/4) Epoch 6, batch 10050, loss[loss=0.2361, simple_loss=0.3033, pruned_loss=0.08446, over 21420.00 frames. ], tot_loss[loss=0.229, simple_loss=0.2992, pruned_loss=0.07941, over 4258735.89 frames. ], batch size: 131, lr: 5.18e-03, grad_scale: 32.0 2023-06-23 23:42:42,366 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=975138.0, ans=0.025 2023-06-23 23:42:42,846 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.18 vs. limit=15.0 2023-06-23 23:43:20,590 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=975258.0, ans=0.125 2023-06-23 23:43:38,321 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=975318.0, ans=0.125 2023-06-23 23:43:45,983 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.49 vs. limit=5.0 2023-06-23 23:44:25,431 INFO [train.py:996] (0/4) Epoch 6, batch 10100, loss[loss=0.2665, simple_loss=0.3769, pruned_loss=0.07806, over 19853.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.2965, pruned_loss=0.0765, over 4265397.10 frames. ], batch size: 702, lr: 5.18e-03, grad_scale: 32.0 2023-06-23 23:45:14,392 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=975558.0, ans=0.125 2023-06-23 23:45:39,578 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.67 vs. limit=22.5 2023-06-23 23:45:59,922 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.054e+02 2.533e+02 2.969e+02 3.783e+02 6.881e+02, threshold=5.937e+02, percent-clipped=1.0 2023-06-23 23:46:21,419 INFO [train.py:996] (0/4) Epoch 6, batch 10150, loss[loss=0.249, simple_loss=0.3271, pruned_loss=0.08545, over 21885.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3018, pruned_loss=0.07849, over 4265006.73 frames. ], batch size: 371, lr: 5.18e-03, grad_scale: 16.0 2023-06-23 23:46:22,073 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=975738.0, ans=0.1 2023-06-23 23:46:40,975 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.72 vs. limit=15.0 2023-06-23 23:47:48,025 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=975978.0, ans=0.125 2023-06-23 23:47:49,946 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=975978.0, ans=0.125 2023-06-23 23:48:09,640 INFO [train.py:996] (0/4) Epoch 6, batch 10200, loss[loss=0.1959, simple_loss=0.2774, pruned_loss=0.05718, over 21605.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.3015, pruned_loss=0.07636, over 4269545.90 frames. ], batch size: 263, lr: 5.18e-03, grad_scale: 16.0 2023-06-23 23:48:43,280 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.07 vs. limit=22.5 2023-06-23 23:48:44,214 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=976098.0, ans=0.1 2023-06-23 23:48:49,231 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=976158.0, ans=0.125 2023-06-23 23:48:54,467 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=976158.0, ans=0.0 2023-06-23 23:49:14,092 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=976218.0, ans=0.125 2023-06-23 23:49:38,143 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.688e+02 2.173e+02 2.583e+02 3.025e+02 4.269e+02, threshold=5.166e+02, percent-clipped=0.0 2023-06-23 23:49:59,505 INFO [train.py:996] (0/4) Epoch 6, batch 10250, loss[loss=0.1802, simple_loss=0.2778, pruned_loss=0.0413, over 21791.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2944, pruned_loss=0.07009, over 4269912.13 frames. ], batch size: 333, lr: 5.18e-03, grad_scale: 16.0 2023-06-23 23:51:39,863 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=976578.0, ans=0.125 2023-06-23 23:51:58,302 INFO [train.py:996] (0/4) Epoch 6, batch 10300, loss[loss=0.1675, simple_loss=0.2554, pruned_loss=0.03986, over 21864.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.2963, pruned_loss=0.07097, over 4275502.68 frames. ], batch size: 107, lr: 5.18e-03, grad_scale: 16.0 2023-06-23 23:52:14,095 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=976638.0, ans=0.125 2023-06-23 23:52:48,892 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=976758.0, ans=0.07 2023-06-23 23:53:28,847 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.542e+02 2.521e+02 2.843e+02 3.478e+02 5.751e+02, threshold=5.686e+02, percent-clipped=3.0 2023-06-23 23:53:52,275 INFO [train.py:996] (0/4) Epoch 6, batch 10350, loss[loss=0.1865, simple_loss=0.2569, pruned_loss=0.0581, over 21664.00 frames. ], tot_loss[loss=0.222, simple_loss=0.2991, pruned_loss=0.07248, over 4276944.30 frames. ], batch size: 247, lr: 5.17e-03, grad_scale: 16.0 2023-06-23 23:53:54,976 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=976938.0, ans=0.2 2023-06-23 23:54:13,800 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.58 vs. limit=10.0 2023-06-23 23:54:14,735 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=976998.0, ans=0.125 2023-06-23 23:54:42,991 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.21 vs. limit=22.5 2023-06-23 23:55:32,823 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.46 vs. limit=15.0 2023-06-23 23:55:34,579 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.46 vs. limit=6.0 2023-06-23 23:55:43,945 INFO [train.py:996] (0/4) Epoch 6, batch 10400, loss[loss=0.2932, simple_loss=0.3498, pruned_loss=0.1183, over 21529.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2927, pruned_loss=0.07067, over 4261347.89 frames. ], batch size: 509, lr: 5.17e-03, grad_scale: 32.0 2023-06-23 23:56:38,626 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=977358.0, ans=0.125 2023-06-23 23:56:53,768 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.67 vs. limit=22.5 2023-06-23 23:57:12,712 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.26 vs. limit=15.0 2023-06-23 23:57:20,292 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.029e+02 2.786e+02 3.233e+02 3.708e+02 5.830e+02, threshold=6.465e+02, percent-clipped=3.0 2023-06-23 23:57:28,587 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=977478.0, ans=10.0 2023-06-23 23:57:29,713 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=977478.0, ans=0.125 2023-06-23 23:57:41,136 INFO [train.py:996] (0/4) Epoch 6, batch 10450, loss[loss=0.2378, simple_loss=0.3123, pruned_loss=0.08168, over 20660.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.2974, pruned_loss=0.07422, over 4271452.83 frames. ], batch size: 607, lr: 5.17e-03, grad_scale: 32.0 2023-06-23 23:58:24,880 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=977598.0, ans=0.125 2023-06-23 23:58:37,356 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=977658.0, ans=0.125 2023-06-23 23:59:04,035 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=977718.0, ans=0.0 2023-06-23 23:59:14,283 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=977778.0, ans=0.0 2023-06-23 23:59:30,735 INFO [train.py:996] (0/4) Epoch 6, batch 10500, loss[loss=0.2296, simple_loss=0.2988, pruned_loss=0.08018, over 15760.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.2967, pruned_loss=0.07309, over 4270063.16 frames. ], batch size: 60, lr: 5.17e-03, grad_scale: 16.0 2023-06-24 00:00:59,800 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.841e+02 2.398e+02 2.689e+02 3.123e+02 4.066e+02, threshold=5.379e+02, percent-clipped=0.0 2023-06-24 00:01:19,032 INFO [train.py:996] (0/4) Epoch 6, batch 10550, loss[loss=0.1846, simple_loss=0.2474, pruned_loss=0.06088, over 21652.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2905, pruned_loss=0.07263, over 4266983.03 frames. ], batch size: 264, lr: 5.17e-03, grad_scale: 16.0 2023-06-24 00:02:05,992 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=978258.0, ans=0.2 2023-06-24 00:03:09,234 INFO [train.py:996] (0/4) Epoch 6, batch 10600, loss[loss=0.1834, simple_loss=0.2611, pruned_loss=0.05282, over 21256.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2876, pruned_loss=0.07138, over 4270809.99 frames. ], batch size: 176, lr: 5.17e-03, grad_scale: 16.0 2023-06-24 00:04:47,268 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.753e+02 2.546e+02 2.981e+02 3.597e+02 7.487e+02, threshold=5.961e+02, percent-clipped=2.0 2023-06-24 00:04:59,203 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=978678.0, ans=0.125 2023-06-24 00:05:12,645 INFO [train.py:996] (0/4) Epoch 6, batch 10650, loss[loss=0.1602, simple_loss=0.2446, pruned_loss=0.03787, over 21612.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2889, pruned_loss=0.06948, over 4263993.47 frames. ], batch size: 247, lr: 5.17e-03, grad_scale: 16.0 2023-06-24 00:05:33,771 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=978798.0, ans=0.1 2023-06-24 00:05:35,850 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=978798.0, ans=0.0 2023-06-24 00:05:40,602 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=978798.0, ans=0.0 2023-06-24 00:06:04,369 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=978858.0, ans=0.1 2023-06-24 00:06:34,159 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=978978.0, ans=0.1 2023-06-24 00:06:34,258 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=978978.0, ans=0.0 2023-06-24 00:07:03,057 INFO [train.py:996] (0/4) Epoch 6, batch 10700, loss[loss=0.2471, simple_loss=0.3192, pruned_loss=0.08751, over 21197.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2901, pruned_loss=0.07025, over 4252569.28 frames. ], batch size: 143, lr: 5.17e-03, grad_scale: 16.0 2023-06-24 00:07:38,005 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=979098.0, ans=0.125 2023-06-24 00:08:21,341 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=979218.0, ans=0.0 2023-06-24 00:08:29,169 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=979218.0, ans=0.0 2023-06-24 00:08:29,200 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=979218.0, ans=0.035 2023-06-24 00:08:30,990 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=979278.0, ans=0.125 2023-06-24 00:08:35,822 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.934e+02 2.562e+02 2.930e+02 3.343e+02 5.418e+02, threshold=5.860e+02, percent-clipped=0.0 2023-06-24 00:08:53,514 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.77 vs. limit=15.0 2023-06-24 00:08:55,547 INFO [train.py:996] (0/4) Epoch 6, batch 10750, loss[loss=0.222, simple_loss=0.3019, pruned_loss=0.07105, over 21373.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.3005, pruned_loss=0.07414, over 4258969.86 frames. ], batch size: 176, lr: 5.17e-03, grad_scale: 16.0 2023-06-24 00:08:56,146 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=979338.0, ans=0.125 2023-06-24 00:09:28,428 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=979398.0, ans=0.125 2023-06-24 00:10:41,423 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 00:10:43,537 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.09 vs. limit=15.0 2023-06-24 00:10:47,843 INFO [train.py:996] (0/4) Epoch 6, batch 10800, loss[loss=0.2486, simple_loss=0.3201, pruned_loss=0.08855, over 21353.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.3047, pruned_loss=0.0749, over 4262237.87 frames. ], batch size: 176, lr: 5.17e-03, grad_scale: 32.0 2023-06-24 00:10:58,190 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=979638.0, ans=0.2 2023-06-24 00:11:02,652 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.59 vs. limit=15.0 2023-06-24 00:11:05,505 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=979638.0, ans=0.125 2023-06-24 00:12:11,590 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=979818.0, ans=0.125 2023-06-24 00:12:15,323 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=979818.0, ans=0.125 2023-06-24 00:12:24,836 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.018e+02 2.761e+02 3.249e+02 3.882e+02 5.958e+02, threshold=6.498e+02, percent-clipped=1.0 2023-06-24 00:12:44,082 INFO [train.py:996] (0/4) Epoch 6, batch 10850, loss[loss=0.1984, simple_loss=0.2765, pruned_loss=0.06015, over 21788.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3054, pruned_loss=0.07511, over 4260086.58 frames. ], batch size: 102, lr: 5.17e-03, grad_scale: 32.0 2023-06-24 00:12:51,655 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=979938.0, ans=0.2 2023-06-24 00:12:56,940 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=979938.0, ans=0.125 2023-06-24 00:14:35,117 INFO [train.py:996] (0/4) Epoch 6, batch 10900, loss[loss=0.2086, simple_loss=0.2945, pruned_loss=0.06142, over 21445.00 frames. ], tot_loss[loss=0.223, simple_loss=0.2993, pruned_loss=0.07335, over 4261494.41 frames. ], batch size: 194, lr: 5.17e-03, grad_scale: 16.0 2023-06-24 00:14:35,956 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=980238.0, ans=0.125 2023-06-24 00:15:14,912 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=980298.0, ans=0.1 2023-06-24 00:15:37,516 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.91 vs. limit=15.0 2023-06-24 00:15:49,662 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.76 vs. limit=10.0 2023-06-24 00:15:52,657 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=980418.0, ans=0.1 2023-06-24 00:16:05,803 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.122e+02 2.411e+02 2.776e+02 2.994e+02 5.292e+02, threshold=5.553e+02, percent-clipped=0.0 2023-06-24 00:16:22,914 INFO [train.py:996] (0/4) Epoch 6, batch 10950, loss[loss=0.2035, simple_loss=0.2711, pruned_loss=0.06794, over 21142.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2924, pruned_loss=0.07111, over 4258203.35 frames. ], batch size: 143, lr: 5.16e-03, grad_scale: 16.0 2023-06-24 00:16:31,981 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=980538.0, ans=0.125 2023-06-24 00:16:44,910 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.84 vs. limit=22.5 2023-06-24 00:16:47,712 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=980598.0, ans=0.0 2023-06-24 00:16:50,979 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=980598.0, ans=0.1 2023-06-24 00:16:52,758 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 00:17:45,354 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.72 vs. limit=12.0 2023-06-24 00:18:13,153 INFO [train.py:996] (0/4) Epoch 6, batch 11000, loss[loss=0.2334, simple_loss=0.2998, pruned_loss=0.08351, over 21593.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2916, pruned_loss=0.07196, over 4258589.97 frames. ], batch size: 212, lr: 5.16e-03, grad_scale: 16.0 2023-06-24 00:19:02,338 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=980958.0, ans=0.125 2023-06-24 00:19:29,303 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=981018.0, ans=0.04949747468305833 2023-06-24 00:19:45,496 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.830e+02 2.423e+02 2.754e+02 3.301e+02 6.173e+02, threshold=5.508e+02, percent-clipped=2.0 2023-06-24 00:19:55,504 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=981078.0, ans=0.125 2023-06-24 00:19:58,305 INFO [train.py:996] (0/4) Epoch 6, batch 11050, loss[loss=0.2136, simple_loss=0.2791, pruned_loss=0.07409, over 21793.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.2904, pruned_loss=0.07345, over 4271423.98 frames. ], batch size: 112, lr: 5.16e-03, grad_scale: 16.0 2023-06-24 00:20:25,930 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=981198.0, ans=0.0 2023-06-24 00:20:33,123 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=981198.0, ans=0.125 2023-06-24 00:20:42,508 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=981198.0, ans=0.125 2023-06-24 00:20:49,486 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=981258.0, ans=0.125 2023-06-24 00:20:58,596 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.01 vs. limit=15.0 2023-06-24 00:21:32,302 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.50 vs. limit=15.0 2023-06-24 00:21:40,020 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=981378.0, ans=0.0 2023-06-24 00:21:45,010 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=981438.0, ans=0.025 2023-06-24 00:21:45,967 INFO [train.py:996] (0/4) Epoch 6, batch 11100, loss[loss=0.2028, simple_loss=0.2804, pruned_loss=0.06262, over 21412.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.289, pruned_loss=0.07344, over 4276438.21 frames. ], batch size: 211, lr: 5.16e-03, grad_scale: 16.0 2023-06-24 00:22:12,282 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=981438.0, ans=0.125 2023-06-24 00:22:56,929 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=981558.0, ans=0.025 2023-06-24 00:23:23,906 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.037e+02 2.487e+02 2.801e+02 3.244e+02 5.802e+02, threshold=5.603e+02, percent-clipped=1.0 2023-06-24 00:23:36,133 INFO [train.py:996] (0/4) Epoch 6, batch 11150, loss[loss=0.1582, simple_loss=0.2248, pruned_loss=0.0458, over 16037.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.287, pruned_loss=0.07297, over 4244703.29 frames. ], batch size: 61, lr: 5.16e-03, grad_scale: 16.0 2023-06-24 00:24:17,351 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.05 vs. limit=10.0 2023-06-24 00:24:23,939 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=981798.0, ans=0.125 2023-06-24 00:24:24,589 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.29 vs. limit=10.0 2023-06-24 00:24:34,463 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=981858.0, ans=0.0 2023-06-24 00:25:10,210 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=981978.0, ans=0.125 2023-06-24 00:25:19,383 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.53 vs. limit=22.5 2023-06-24 00:25:27,176 INFO [train.py:996] (0/4) Epoch 6, batch 11200, loss[loss=0.217, simple_loss=0.287, pruned_loss=0.07351, over 21746.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2867, pruned_loss=0.07237, over 4254826.75 frames. ], batch size: 351, lr: 5.16e-03, grad_scale: 32.0 2023-06-24 00:26:09,323 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=982098.0, ans=0.0 2023-06-24 00:26:17,796 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=982158.0, ans=0.125 2023-06-24 00:26:43,297 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=982218.0, ans=0.1 2023-06-24 00:26:46,894 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=982218.0, ans=0.125 2023-06-24 00:27:03,176 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.056e+02 2.434e+02 2.676e+02 2.972e+02 5.122e+02, threshold=5.353e+02, percent-clipped=0.0 2023-06-24 00:27:12,297 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=982278.0, ans=0.125 2023-06-24 00:27:15,147 INFO [train.py:996] (0/4) Epoch 6, batch 11250, loss[loss=0.2504, simple_loss=0.3096, pruned_loss=0.0956, over 21571.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2867, pruned_loss=0.07237, over 4252695.37 frames. ], batch size: 508, lr: 5.16e-03, grad_scale: 32.0 2023-06-24 00:27:43,855 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=982398.0, ans=0.125 2023-06-24 00:28:30,023 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=982518.0, ans=0.125 2023-06-24 00:28:30,515 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.26 vs. limit=22.5 2023-06-24 00:28:37,173 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 00:29:03,621 INFO [train.py:996] (0/4) Epoch 6, batch 11300, loss[loss=0.171, simple_loss=0.2461, pruned_loss=0.04795, over 17002.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2887, pruned_loss=0.07293, over 4262260.66 frames. ], batch size: 63, lr: 5.16e-03, grad_scale: 16.0 2023-06-24 00:29:10,907 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.91 vs. limit=15.0 2023-06-24 00:30:43,985 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.722e+02 2.481e+02 2.716e+02 3.096e+02 3.979e+02, threshold=5.433e+02, percent-clipped=0.0 2023-06-24 00:31:01,011 INFO [train.py:996] (0/4) Epoch 6, batch 11350, loss[loss=0.2293, simple_loss=0.3058, pruned_loss=0.07643, over 21246.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2889, pruned_loss=0.07208, over 4267557.26 frames. ], batch size: 143, lr: 5.16e-03, grad_scale: 16.0 2023-06-24 00:31:33,480 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.57 vs. limit=15.0 2023-06-24 00:31:40,971 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=10.67 vs. limit=15.0 2023-06-24 00:31:41,737 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=982998.0, ans=0.125 2023-06-24 00:32:57,405 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.05 vs. limit=6.0 2023-06-24 00:32:58,534 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=983238.0, ans=0.125 2023-06-24 00:32:59,515 INFO [train.py:996] (0/4) Epoch 6, batch 11400, loss[loss=0.1863, simple_loss=0.2512, pruned_loss=0.06066, over 16128.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.2955, pruned_loss=0.07532, over 4266966.23 frames. ], batch size: 60, lr: 5.16e-03, grad_scale: 16.0 2023-06-24 00:33:24,049 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.28 vs. limit=15.0 2023-06-24 00:34:38,922 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.986e+02 2.559e+02 2.841e+02 3.332e+02 5.224e+02, threshold=5.682e+02, percent-clipped=0.0 2023-06-24 00:34:49,726 INFO [train.py:996] (0/4) Epoch 6, batch 11450, loss[loss=0.2443, simple_loss=0.3396, pruned_loss=0.07448, over 21731.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.2981, pruned_loss=0.07484, over 4272534.14 frames. ], batch size: 415, lr: 5.16e-03, grad_scale: 16.0 2023-06-24 00:35:13,789 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=983598.0, ans=0.0 2023-06-24 00:35:49,336 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=983658.0, ans=0.025 2023-06-24 00:36:34,338 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=983778.0, ans=0.0 2023-06-24 00:36:46,012 INFO [train.py:996] (0/4) Epoch 6, batch 11500, loss[loss=0.2394, simple_loss=0.3213, pruned_loss=0.07874, over 21615.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3007, pruned_loss=0.07611, over 4275682.62 frames. ], batch size: 414, lr: 5.16e-03, grad_scale: 16.0 2023-06-24 00:37:08,189 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=983898.0, ans=0.1 2023-06-24 00:37:08,752 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.52 vs. limit=10.0 2023-06-24 00:37:38,262 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-164000.pt 2023-06-24 00:38:21,043 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=984078.0, ans=0.125 2023-06-24 00:38:21,814 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.94 vs. limit=22.5 2023-06-24 00:38:29,590 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.054e+02 2.699e+02 3.055e+02 3.965e+02 5.631e+02, threshold=6.111e+02, percent-clipped=0.0 2023-06-24 00:38:41,229 INFO [train.py:996] (0/4) Epoch 6, batch 11550, loss[loss=0.2396, simple_loss=0.3193, pruned_loss=0.07991, over 21286.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.3049, pruned_loss=0.07506, over 4280035.30 frames. ], batch size: 176, lr: 5.16e-03, grad_scale: 16.0 2023-06-24 00:39:22,545 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=984258.0, ans=0.04949747468305833 2023-06-24 00:39:22,560 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=984258.0, ans=0.125 2023-06-24 00:40:13,261 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=984318.0, ans=0.0 2023-06-24 00:40:38,812 INFO [train.py:996] (0/4) Epoch 6, batch 11600, loss[loss=0.2253, simple_loss=0.3167, pruned_loss=0.06697, over 21530.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.3203, pruned_loss=0.07815, over 4276392.18 frames. ], batch size: 131, lr: 5.15e-03, grad_scale: 32.0 2023-06-24 00:40:48,509 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=984438.0, ans=0.0 2023-06-24 00:41:41,903 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=984618.0, ans=0.125 2023-06-24 00:41:43,504 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=984618.0, ans=0.1 2023-06-24 00:41:54,480 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=984618.0, ans=0.125 2023-06-24 00:42:07,058 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=984678.0, ans=0.1 2023-06-24 00:42:09,237 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=984678.0, ans=0.125 2023-06-24 00:42:12,459 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=984678.0, ans=0.125 2023-06-24 00:42:15,290 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.062e+02 2.879e+02 3.402e+02 4.224e+02 8.565e+02, threshold=6.804e+02, percent-clipped=5.0 2023-06-24 00:42:28,799 INFO [train.py:996] (0/4) Epoch 6, batch 11650, loss[loss=0.2749, simple_loss=0.373, pruned_loss=0.08845, over 21886.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3245, pruned_loss=0.07816, over 4274363.62 frames. ], batch size: 317, lr: 5.15e-03, grad_scale: 16.0 2023-06-24 00:42:57,208 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=984798.0, ans=0.1 2023-06-24 00:44:12,079 INFO [train.py:996] (0/4) Epoch 6, batch 11700, loss[loss=0.1948, simple_loss=0.2594, pruned_loss=0.06514, over 21589.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3162, pruned_loss=0.07811, over 4270859.45 frames. ], batch size: 263, lr: 5.15e-03, grad_scale: 16.0 2023-06-24 00:44:15,081 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.09 vs. limit=15.0 2023-06-24 00:45:06,931 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=985158.0, ans=0.2 2023-06-24 00:45:52,495 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.098e+02 2.525e+02 2.747e+02 3.370e+02 5.066e+02, threshold=5.494e+02, percent-clipped=0.0 2023-06-24 00:46:01,424 INFO [train.py:996] (0/4) Epoch 6, batch 11750, loss[loss=0.2239, simple_loss=0.2795, pruned_loss=0.08418, over 21297.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3073, pruned_loss=0.07742, over 4272548.04 frames. ], batch size: 144, lr: 5.15e-03, grad_scale: 16.0 2023-06-24 00:47:51,441 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=985638.0, ans=0.125 2023-06-24 00:47:52,514 INFO [train.py:996] (0/4) Epoch 6, batch 11800, loss[loss=0.22, simple_loss=0.3163, pruned_loss=0.06186, over 21722.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3078, pruned_loss=0.0787, over 4275578.01 frames. ], batch size: 298, lr: 5.15e-03, grad_scale: 16.0 2023-06-24 00:47:53,247 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=985638.0, ans=0.125 2023-06-24 00:48:19,017 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 00:48:35,248 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.49 vs. limit=15.0 2023-06-24 00:49:34,927 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.881e+02 2.469e+02 2.710e+02 3.084e+02 4.949e+02, threshold=5.420e+02, percent-clipped=0.0 2023-06-24 00:49:43,234 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn2.whiten.whitening_limit, batch_count=985938.0, ans=22.5 2023-06-24 00:49:43,736 INFO [train.py:996] (0/4) Epoch 6, batch 11850, loss[loss=0.2213, simple_loss=0.3288, pruned_loss=0.05691, over 20773.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3102, pruned_loss=0.07837, over 4280906.08 frames. ], batch size: 608, lr: 5.15e-03, grad_scale: 16.0 2023-06-24 00:50:01,231 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=985938.0, ans=0.125 2023-06-24 00:50:06,782 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=985998.0, ans=0.0 2023-06-24 00:50:42,063 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=986058.0, ans=0.04949747468305833 2023-06-24 00:51:34,360 INFO [train.py:996] (0/4) Epoch 6, batch 11900, loss[loss=0.2302, simple_loss=0.3078, pruned_loss=0.07636, over 21662.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3102, pruned_loss=0.07638, over 4276893.95 frames. ], batch size: 332, lr: 5.15e-03, grad_scale: 16.0 2023-06-24 00:52:14,609 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=986298.0, ans=0.0 2023-06-24 00:52:18,242 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=986298.0, ans=0.1 2023-06-24 00:52:55,257 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=986418.0, ans=0.2 2023-06-24 00:53:16,515 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.796e+02 2.327e+02 2.667e+02 3.121e+02 4.121e+02, threshold=5.333e+02, percent-clipped=0.0 2023-06-24 00:53:31,107 INFO [train.py:996] (0/4) Epoch 6, batch 11950, loss[loss=0.2206, simple_loss=0.3207, pruned_loss=0.06022, over 21692.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3131, pruned_loss=0.07405, over 4271155.21 frames. ], batch size: 414, lr: 5.15e-03, grad_scale: 16.0 2023-06-24 00:53:31,660 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=986538.0, ans=0.0 2023-06-24 00:53:37,428 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.53 vs. limit=15.0 2023-06-24 00:54:28,075 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.86 vs. limit=22.5 2023-06-24 00:54:28,252 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.90 vs. limit=10.0 2023-06-24 00:55:10,172 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.71 vs. limit=15.0 2023-06-24 00:55:19,968 INFO [train.py:996] (0/4) Epoch 6, batch 12000, loss[loss=0.2147, simple_loss=0.277, pruned_loss=0.07619, over 21225.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.3068, pruned_loss=0.07217, over 4273967.80 frames. ], batch size: 144, lr: 5.15e-03, grad_scale: 32.0 2023-06-24 00:55:19,969 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-24 00:55:44,729 INFO [train.py:1028] (0/4) Epoch 6, validation: loss=0.2624, simple_loss=0.3526, pruned_loss=0.08607, over 1796401.00 frames. 2023-06-24 00:55:44,730 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23616MB 2023-06-24 00:56:02,455 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 00:56:09,360 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=986898.0, ans=0.2 2023-06-24 00:56:38,721 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.98 vs. limit=12.0 2023-06-24 00:56:49,088 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.91 vs. limit=5.0 2023-06-24 00:57:13,315 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.21 vs. limit=15.0 2023-06-24 00:57:13,630 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.670e+02 2.572e+02 3.062e+02 3.583e+02 6.186e+02, threshold=6.124e+02, percent-clipped=1.0 2023-06-24 00:57:14,320 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=987078.0, ans=0.025 2023-06-24 00:57:25,164 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.20 vs. limit=22.5 2023-06-24 00:57:27,290 INFO [train.py:996] (0/4) Epoch 6, batch 12050, loss[loss=0.2144, simple_loss=0.2823, pruned_loss=0.07323, over 21863.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.3032, pruned_loss=0.07372, over 4279713.25 frames. ], batch size: 351, lr: 5.15e-03, grad_scale: 32.0 2023-06-24 00:57:27,666 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=987138.0, ans=0.125 2023-06-24 00:57:42,423 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 00:57:48,520 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.29 vs. limit=12.0 2023-06-24 00:58:37,357 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=987318.0, ans=0.125 2023-06-24 00:59:24,257 INFO [train.py:996] (0/4) Epoch 6, batch 12100, loss[loss=0.2417, simple_loss=0.3183, pruned_loss=0.08252, over 21642.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3075, pruned_loss=0.07815, over 4285452.23 frames. ], batch size: 230, lr: 5.15e-03, grad_scale: 16.0 2023-06-24 01:00:00,318 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=987498.0, ans=0.2 2023-06-24 01:00:58,650 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=987678.0, ans=0.125 2023-06-24 01:01:06,906 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.880e+02 2.682e+02 3.113e+02 3.706e+02 5.999e+02, threshold=6.227e+02, percent-clipped=0.0 2023-06-24 01:01:14,022 INFO [train.py:996] (0/4) Epoch 6, batch 12150, loss[loss=0.2517, simple_loss=0.3518, pruned_loss=0.07576, over 21295.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3097, pruned_loss=0.07682, over 4273460.78 frames. ], batch size: 548, lr: 5.15e-03, grad_scale: 16.0 2023-06-24 01:01:14,748 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=987738.0, ans=0.0 2023-06-24 01:01:14,750 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=987738.0, ans=0.125 2023-06-24 01:01:37,092 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.03 vs. limit=22.5 2023-06-24 01:02:03,307 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=987858.0, ans=0.0 2023-06-24 01:02:15,462 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=987858.0, ans=0.0 2023-06-24 01:02:17,174 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=987858.0, ans=0.125 2023-06-24 01:02:47,544 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.81 vs. limit=15.0 2023-06-24 01:02:55,336 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=987978.0, ans=0.2 2023-06-24 01:03:08,589 INFO [train.py:996] (0/4) Epoch 6, batch 12200, loss[loss=0.205, simple_loss=0.2684, pruned_loss=0.07077, over 21491.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3047, pruned_loss=0.07546, over 4271982.98 frames. ], batch size: 391, lr: 5.15e-03, grad_scale: 16.0 2023-06-24 01:03:35,645 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=988098.0, ans=0.1 2023-06-24 01:03:52,253 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=988158.0, ans=0.125 2023-06-24 01:03:55,522 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=988158.0, ans=0.125 2023-06-24 01:04:39,378 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.63 vs. limit=15.0 2023-06-24 01:04:45,487 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.529e+02 2.375e+02 2.667e+02 3.386e+02 5.475e+02, threshold=5.334e+02, percent-clipped=0.0 2023-06-24 01:04:57,300 INFO [train.py:996] (0/4) Epoch 6, batch 12250, loss[loss=0.1956, simple_loss=0.2757, pruned_loss=0.05777, over 21489.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.2978, pruned_loss=0.07201, over 4261399.93 frames. ], batch size: 471, lr: 5.14e-03, grad_scale: 16.0 2023-06-24 01:05:04,619 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=988338.0, ans=0.0 2023-06-24 01:05:08,004 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=988338.0, ans=0.125 2023-06-24 01:05:09,862 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=988338.0, ans=0.125 2023-06-24 01:05:47,532 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.11 vs. limit=15.0 2023-06-24 01:06:04,235 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 01:06:40,999 INFO [train.py:996] (0/4) Epoch 6, batch 12300, loss[loss=0.186, simple_loss=0.2665, pruned_loss=0.05277, over 21216.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.2909, pruned_loss=0.06696, over 4257666.53 frames. ], batch size: 176, lr: 5.14e-03, grad_scale: 8.0 2023-06-24 01:06:56,329 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.67 vs. limit=15.0 2023-06-24 01:07:20,689 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.05 vs. limit=22.5 2023-06-24 01:07:40,303 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=988758.0, ans=15.0 2023-06-24 01:08:15,484 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=988878.0, ans=0.0 2023-06-24 01:08:17,376 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=988878.0, ans=0.0 2023-06-24 01:08:25,516 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.495e+02 2.150e+02 2.660e+02 3.179e+02 5.593e+02, threshold=5.319e+02, percent-clipped=1.0 2023-06-24 01:08:36,005 INFO [train.py:996] (0/4) Epoch 6, batch 12350, loss[loss=0.2286, simple_loss=0.3097, pruned_loss=0.07373, over 21834.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2965, pruned_loss=0.06784, over 4261191.91 frames. ], batch size: 118, lr: 5.14e-03, grad_scale: 8.0 2023-06-24 01:08:45,068 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=988938.0, ans=0.125 2023-06-24 01:09:06,940 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.15 vs. limit=10.0 2023-06-24 01:09:23,873 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=989058.0, ans=0.0 2023-06-24 01:09:43,783 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=989118.0, ans=0.1 2023-06-24 01:09:49,152 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=989118.0, ans=0.2 2023-06-24 01:10:21,531 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=989178.0, ans=0.125 2023-06-24 01:10:24,593 INFO [train.py:996] (0/4) Epoch 6, batch 12400, loss[loss=0.228, simple_loss=0.3036, pruned_loss=0.07626, over 21895.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.2982, pruned_loss=0.07146, over 4271618.86 frames. ], batch size: 124, lr: 5.14e-03, grad_scale: 16.0 2023-06-24 01:11:37,075 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.12 vs. limit=15.0 2023-06-24 01:12:08,942 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.082e+02 2.631e+02 2.949e+02 3.533e+02 4.721e+02, threshold=5.899e+02, percent-clipped=0.0 2023-06-24 01:12:14,225 INFO [train.py:996] (0/4) Epoch 6, batch 12450, loss[loss=0.2583, simple_loss=0.3236, pruned_loss=0.09644, over 21378.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.3014, pruned_loss=0.07445, over 4272426.92 frames. ], batch size: 548, lr: 5.14e-03, grad_scale: 16.0 2023-06-24 01:13:36,242 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=989718.0, ans=0.0 2023-06-24 01:13:52,955 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.05 vs. limit=22.5 2023-06-24 01:14:08,584 INFO [train.py:996] (0/4) Epoch 6, batch 12500, loss[loss=0.2421, simple_loss=0.3312, pruned_loss=0.07654, over 21471.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.314, pruned_loss=0.07835, over 4279722.90 frames. ], batch size: 131, lr: 5.14e-03, grad_scale: 16.0 2023-06-24 01:14:58,121 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=989898.0, ans=0.2 2023-06-24 01:15:01,914 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=989958.0, ans=0.125 2023-06-24 01:15:14,856 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=989958.0, ans=0.0 2023-06-24 01:16:01,943 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.301e+02 2.735e+02 3.011e+02 3.446e+02 4.823e+02, threshold=6.021e+02, percent-clipped=0.0 2023-06-24 01:16:07,452 INFO [train.py:996] (0/4) Epoch 6, batch 12550, loss[loss=0.3146, simple_loss=0.3736, pruned_loss=0.1278, over 21323.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3189, pruned_loss=0.08069, over 4276266.93 frames. ], batch size: 507, lr: 5.14e-03, grad_scale: 16.0 2023-06-24 01:16:24,956 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.93 vs. limit=10.0 2023-06-24 01:16:27,798 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=990138.0, ans=0.0 2023-06-24 01:16:35,122 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=990198.0, ans=0.0 2023-06-24 01:16:40,460 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=990198.0, ans=0.015 2023-06-24 01:18:03,127 INFO [train.py:996] (0/4) Epoch 6, batch 12600, loss[loss=0.1921, simple_loss=0.2874, pruned_loss=0.04838, over 21671.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3162, pruned_loss=0.07826, over 4279349.72 frames. ], batch size: 247, lr: 5.14e-03, grad_scale: 16.0 2023-06-24 01:18:14,572 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=990438.0, ans=0.125 2023-06-24 01:18:49,361 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=990558.0, ans=0.125 2023-06-24 01:18:54,996 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.01 vs. limit=15.0 2023-06-24 01:19:11,995 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=990618.0, ans=0.125 2023-06-24 01:19:26,808 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.28 vs. limit=22.5 2023-06-24 01:19:46,575 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.745e+02 2.342e+02 2.712e+02 3.358e+02 5.513e+02, threshold=5.424e+02, percent-clipped=0.0 2023-06-24 01:19:51,708 INFO [train.py:996] (0/4) Epoch 6, batch 12650, loss[loss=0.24, simple_loss=0.3055, pruned_loss=0.0873, over 21879.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3088, pruned_loss=0.07454, over 4282251.55 frames. ], batch size: 124, lr: 5.14e-03, grad_scale: 16.0 2023-06-24 01:20:35,177 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=990858.0, ans=0.0 2023-06-24 01:21:28,799 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=990978.0, ans=0.125 2023-06-24 01:21:40,700 INFO [train.py:996] (0/4) Epoch 6, batch 12700, loss[loss=0.3003, simple_loss=0.3472, pruned_loss=0.1267, over 21538.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3085, pruned_loss=0.07648, over 4287937.45 frames. ], batch size: 471, lr: 5.14e-03, grad_scale: 16.0 2023-06-24 01:21:45,116 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 01:21:56,677 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.12 vs. limit=10.0 2023-06-24 01:22:26,714 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=991158.0, ans=0.125 2023-06-24 01:23:25,533 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.805e+02 2.607e+02 2.938e+02 3.445e+02 5.217e+02, threshold=5.876e+02, percent-clipped=0.0 2023-06-24 01:23:28,250 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=991278.0, ans=0.125 2023-06-24 01:23:31,066 INFO [train.py:996] (0/4) Epoch 6, batch 12750, loss[loss=0.245, simple_loss=0.3174, pruned_loss=0.08628, over 19883.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.3102, pruned_loss=0.0772, over 4277710.16 frames. ], batch size: 702, lr: 5.14e-03, grad_scale: 16.0 2023-06-24 01:23:40,018 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=991338.0, ans=0.2 2023-06-24 01:24:52,101 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=991518.0, ans=0.2 2023-06-24 01:25:10,224 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.40 vs. limit=22.5 2023-06-24 01:25:18,811 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=991638.0, ans=0.125 2023-06-24 01:25:19,754 INFO [train.py:996] (0/4) Epoch 6, batch 12800, loss[loss=0.2721, simple_loss=0.3425, pruned_loss=0.1008, over 21444.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3106, pruned_loss=0.07831, over 4286384.14 frames. ], batch size: 176, lr: 5.14e-03, grad_scale: 32.0 2023-06-24 01:25:22,275 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=991638.0, ans=0.2 2023-06-24 01:25:47,112 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.57 vs. limit=22.5 2023-06-24 01:25:55,652 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.57 vs. limit=22.5 2023-06-24 01:26:03,507 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=991758.0, ans=0.2 2023-06-24 01:26:03,590 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=991758.0, ans=0.2 2023-06-24 01:26:10,787 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=991758.0, ans=0.0 2023-06-24 01:26:55,069 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=991878.0, ans=0.125 2023-06-24 01:26:58,324 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=991878.0, ans=10.0 2023-06-24 01:27:00,710 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=991878.0, ans=0.04949747468305833 2023-06-24 01:27:06,618 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.019e+02 2.498e+02 2.671e+02 3.042e+02 5.514e+02, threshold=5.341e+02, percent-clipped=0.0 2023-06-24 01:27:10,344 INFO [train.py:996] (0/4) Epoch 6, batch 12850, loss[loss=0.2069, simple_loss=0.3013, pruned_loss=0.05626, over 21899.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3122, pruned_loss=0.07972, over 4288941.36 frames. ], batch size: 316, lr: 5.14e-03, grad_scale: 16.0 2023-06-24 01:27:29,196 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.27 vs. limit=15.0 2023-06-24 01:28:07,502 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=992058.0, ans=0.1 2023-06-24 01:28:30,805 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=992118.0, ans=0.1 2023-06-24 01:28:49,302 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=992178.0, ans=0.125 2023-06-24 01:29:02,153 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.66 vs. limit=22.5 2023-06-24 01:29:08,030 INFO [train.py:996] (0/4) Epoch 6, batch 12900, loss[loss=0.1977, simple_loss=0.2834, pruned_loss=0.05604, over 21597.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3097, pruned_loss=0.07583, over 4282116.28 frames. ], batch size: 263, lr: 5.13e-03, grad_scale: 16.0 2023-06-24 01:29:26,356 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=992298.0, ans=0.035 2023-06-24 01:29:45,231 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=992298.0, ans=0.125 2023-06-24 01:29:50,646 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=992298.0, ans=0.0 2023-06-24 01:30:22,307 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=992418.0, ans=0.0 2023-06-24 01:30:34,502 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=992478.0, ans=0.0 2023-06-24 01:30:47,595 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.64 vs. limit=15.0 2023-06-24 01:30:55,031 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.691e+02 2.252e+02 2.502e+02 2.973e+02 5.465e+02, threshold=5.003e+02, percent-clipped=1.0 2023-06-24 01:30:58,574 INFO [train.py:996] (0/4) Epoch 6, batch 12950, loss[loss=0.1884, simple_loss=0.2764, pruned_loss=0.05021, over 21744.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.307, pruned_loss=0.07392, over 4276376.41 frames. ], batch size: 332, lr: 5.13e-03, grad_scale: 16.0 2023-06-24 01:31:13,109 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=992538.0, ans=0.0 2023-06-24 01:32:04,002 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.33 vs. limit=15.0 2023-06-24 01:32:08,882 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=992718.0, ans=0.1 2023-06-24 01:32:14,053 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_ff2.min_abs, batch_count=992718.0, ans=0.1 2023-06-24 01:32:23,350 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.99 vs. limit=15.0 2023-06-24 01:32:47,516 INFO [train.py:996] (0/4) Epoch 6, batch 13000, loss[loss=0.1911, simple_loss=0.2726, pruned_loss=0.05477, over 21684.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.3058, pruned_loss=0.07371, over 4275587.77 frames. ], batch size: 263, lr: 5.13e-03, grad_scale: 16.0 2023-06-24 01:33:00,516 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=992838.0, ans=0.125 2023-06-24 01:33:09,571 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=992898.0, ans=0.125 2023-06-24 01:33:11,234 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=992898.0, ans=0.125 2023-06-24 01:33:52,248 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.60 vs. limit=15.0 2023-06-24 01:34:25,258 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=993078.0, ans=0.125 2023-06-24 01:34:33,612 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.751e+02 2.511e+02 2.962e+02 3.599e+02 5.386e+02, threshold=5.923e+02, percent-clipped=1.0 2023-06-24 01:34:36,951 INFO [train.py:996] (0/4) Epoch 6, batch 13050, loss[loss=0.2708, simple_loss=0.3241, pruned_loss=0.1088, over 21639.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.3017, pruned_loss=0.07173, over 4268342.37 frames. ], batch size: 471, lr: 5.13e-03, grad_scale: 16.0 2023-06-24 01:35:15,771 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=993198.0, ans=0.0 2023-06-24 01:35:55,358 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=993318.0, ans=0.035 2023-06-24 01:36:08,614 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.93 vs. limit=15.0 2023-06-24 01:36:21,561 INFO [train.py:996] (0/4) Epoch 6, batch 13100, loss[loss=0.279, simple_loss=0.3524, pruned_loss=0.1028, over 21175.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.3042, pruned_loss=0.07128, over 4277825.29 frames. ], batch size: 143, lr: 5.13e-03, grad_scale: 16.0 2023-06-24 01:36:39,861 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=993438.0, ans=0.07 2023-06-24 01:36:48,162 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=993438.0, ans=0.125 2023-06-24 01:36:49,624 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=993498.0, ans=0.1 2023-06-24 01:37:19,898 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=993558.0, ans=0.0 2023-06-24 01:37:42,908 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=993618.0, ans=0.125 2023-06-24 01:37:42,986 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=993618.0, ans=0.1 2023-06-24 01:38:09,184 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.779e+02 2.775e+02 3.249e+02 4.198e+02 6.182e+02, threshold=6.497e+02, percent-clipped=2.0 2023-06-24 01:38:18,936 INFO [train.py:996] (0/4) Epoch 6, batch 13150, loss[loss=0.3288, simple_loss=0.4541, pruned_loss=0.1017, over 19788.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3085, pruned_loss=0.07453, over 4275199.22 frames. ], batch size: 702, lr: 5.13e-03, grad_scale: 16.0 2023-06-24 01:38:37,239 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=993738.0, ans=0.0 2023-06-24 01:38:53,254 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=993798.0, ans=0.125 2023-06-24 01:38:55,230 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=993798.0, ans=0.0 2023-06-24 01:39:17,169 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.56 vs. limit=15.0 2023-06-24 01:39:46,507 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=993978.0, ans=0.0 2023-06-24 01:40:09,789 INFO [train.py:996] (0/4) Epoch 6, batch 13200, loss[loss=0.2394, simple_loss=0.3051, pruned_loss=0.08687, over 21289.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3069, pruned_loss=0.07542, over 4275370.08 frames. ], batch size: 176, lr: 5.13e-03, grad_scale: 32.0 2023-06-24 01:40:24,326 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=994038.0, ans=0.09899494936611666 2023-06-24 01:40:38,575 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=994098.0, ans=0.0 2023-06-24 01:41:21,148 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=994218.0, ans=0.2 2023-06-24 01:41:48,524 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.85 vs. limit=22.5 2023-06-24 01:41:56,117 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.076e+02 2.674e+02 2.987e+02 3.685e+02 5.841e+02, threshold=5.974e+02, percent-clipped=0.0 2023-06-24 01:41:59,678 INFO [train.py:996] (0/4) Epoch 6, batch 13250, loss[loss=0.2159, simple_loss=0.2993, pruned_loss=0.0663, over 21525.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3065, pruned_loss=0.07708, over 4276115.30 frames. ], batch size: 131, lr: 5.13e-03, grad_scale: 32.0 2023-06-24 01:42:19,909 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=994398.0, ans=0.0 2023-06-24 01:43:18,132 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=994518.0, ans=0.0 2023-06-24 01:43:32,890 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=994578.0, ans=0.0 2023-06-24 01:43:40,461 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.65 vs. limit=12.0 2023-06-24 01:43:49,666 INFO [train.py:996] (0/4) Epoch 6, batch 13300, loss[loss=0.2477, simple_loss=0.333, pruned_loss=0.0812, over 21637.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3099, pruned_loss=0.07755, over 4271555.53 frames. ], batch size: 389, lr: 5.13e-03, grad_scale: 32.0 2023-06-24 01:44:08,179 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=994638.0, ans=0.125 2023-06-24 01:44:34,538 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.61 vs. limit=15.0 2023-06-24 01:45:17,602 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=994818.0, ans=0.95 2023-06-24 01:45:28,314 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=994878.0, ans=0.2 2023-06-24 01:45:30,218 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=994878.0, ans=0.05 2023-06-24 01:45:41,749 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.934e+02 2.520e+02 2.865e+02 3.222e+02 4.480e+02, threshold=5.730e+02, percent-clipped=0.0 2023-06-24 01:45:41,780 INFO [train.py:996] (0/4) Epoch 6, batch 13350, loss[loss=0.238, simple_loss=0.3219, pruned_loss=0.07703, over 21665.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3136, pruned_loss=0.07975, over 4276606.72 frames. ], batch size: 351, lr: 5.13e-03, grad_scale: 8.0 2023-06-24 01:45:42,940 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=994938.0, ans=15.0 2023-06-24 01:45:58,508 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=994938.0, ans=0.0 2023-06-24 01:46:30,548 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=995058.0, ans=0.0 2023-06-24 01:46:56,358 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.64 vs. limit=22.5 2023-06-24 01:47:32,605 INFO [train.py:996] (0/4) Epoch 6, batch 13400, loss[loss=0.2351, simple_loss=0.3592, pruned_loss=0.05546, over 19788.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.3153, pruned_loss=0.08016, over 4271426.41 frames. ], batch size: 702, lr: 5.13e-03, grad_scale: 8.0 2023-06-24 01:47:36,717 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=995238.0, ans=0.1 2023-06-24 01:47:40,382 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=995238.0, ans=0.2 2023-06-24 01:47:50,695 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=995298.0, ans=0.0 2023-06-24 01:48:00,675 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=995298.0, ans=0.0 2023-06-24 01:48:07,347 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=995298.0, ans=0.0 2023-06-24 01:48:11,032 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=995298.0, ans=0.2 2023-06-24 01:48:20,440 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=995358.0, ans=0.0 2023-06-24 01:48:42,539 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=995418.0, ans=0.125 2023-06-24 01:48:56,855 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=995418.0, ans=0.2 2023-06-24 01:49:23,335 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.257e+02 2.783e+02 3.072e+02 3.557e+02 5.639e+02, threshold=6.143e+02, percent-clipped=0.0 2023-06-24 01:49:23,380 INFO [train.py:996] (0/4) Epoch 6, batch 13450, loss[loss=0.259, simple_loss=0.3325, pruned_loss=0.09281, over 21531.00 frames. ], tot_loss[loss=0.2406, simple_loss=0.3169, pruned_loss=0.08218, over 4273600.07 frames. ], batch size: 131, lr: 5.13e-03, grad_scale: 8.0 2023-06-24 01:49:32,865 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=995538.0, ans=0.1 2023-06-24 01:49:32,920 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=995538.0, ans=0.125 2023-06-24 01:49:42,038 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=995538.0, ans=0.125 2023-06-24 01:50:15,772 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=995658.0, ans=0.2 2023-06-24 01:51:13,906 INFO [train.py:996] (0/4) Epoch 6, batch 13500, loss[loss=0.2711, simple_loss=0.3425, pruned_loss=0.09989, over 21698.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.3041, pruned_loss=0.0788, over 4261374.93 frames. ], batch size: 441, lr: 5.13e-03, grad_scale: 8.0 2023-06-24 01:51:21,700 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 01:51:29,219 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=995838.0, ans=0.0 2023-06-24 01:51:49,237 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=995898.0, ans=0.125 2023-06-24 01:52:13,122 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=995958.0, ans=0.125 2023-06-24 01:52:41,859 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=996078.0, ans=0.125 2023-06-24 01:53:06,792 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.929e+02 2.607e+02 3.013e+02 3.630e+02 7.011e+02, threshold=6.026e+02, percent-clipped=1.0 2023-06-24 01:53:06,826 INFO [train.py:996] (0/4) Epoch 6, batch 13550, loss[loss=0.2423, simple_loss=0.333, pruned_loss=0.07578, over 20770.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3073, pruned_loss=0.07795, over 4261179.56 frames. ], batch size: 607, lr: 5.12e-03, grad_scale: 8.0 2023-06-24 01:53:13,919 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.79 vs. limit=22.5 2023-06-24 01:53:22,092 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=996138.0, ans=0.1 2023-06-24 01:53:40,474 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=996198.0, ans=0.125 2023-06-24 01:53:59,638 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=996258.0, ans=0.0 2023-06-24 01:54:17,220 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=996318.0, ans=0.125 2023-06-24 01:54:32,353 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=996378.0, ans=0.125 2023-06-24 01:54:36,414 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.37 vs. limit=10.0 2023-06-24 01:54:52,430 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=996378.0, ans=0.125 2023-06-24 01:54:57,332 INFO [train.py:996] (0/4) Epoch 6, batch 13600, loss[loss=0.2962, simple_loss=0.345, pruned_loss=0.1238, over 21685.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3082, pruned_loss=0.07838, over 4268075.66 frames. ], batch size: 507, lr: 5.12e-03, grad_scale: 16.0 2023-06-24 01:55:22,867 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 01:56:07,055 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=996618.0, ans=0.125 2023-06-24 01:56:14,204 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=996618.0, ans=0.2 2023-06-24 01:56:35,657 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=996678.0, ans=0.1 2023-06-24 01:56:46,165 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=996738.0, ans=0.0 2023-06-24 01:56:47,195 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.983e+02 2.489e+02 2.780e+02 3.135e+02 6.333e+02, threshold=5.560e+02, percent-clipped=1.0 2023-06-24 01:56:47,228 INFO [train.py:996] (0/4) Epoch 6, batch 13650, loss[loss=0.2027, simple_loss=0.2644, pruned_loss=0.07051, over 21540.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3049, pruned_loss=0.07626, over 4271494.32 frames. ], batch size: 414, lr: 5.12e-03, grad_scale: 16.0 2023-06-24 01:57:24,002 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.72 vs. limit=6.0 2023-06-24 01:57:43,078 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=996858.0, ans=0.125 2023-06-24 01:58:37,420 INFO [train.py:996] (0/4) Epoch 6, batch 13700, loss[loss=0.2632, simple_loss=0.3394, pruned_loss=0.0935, over 21645.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.2995, pruned_loss=0.07565, over 4270700.36 frames. ], batch size: 441, lr: 5.12e-03, grad_scale: 16.0 2023-06-24 01:58:42,782 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.78 vs. limit=8.0 2023-06-24 01:59:04,952 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=13.00 vs. limit=15.0 2023-06-24 01:59:08,595 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.38 vs. limit=15.0 2023-06-24 01:59:09,675 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=997098.0, ans=0.125 2023-06-24 01:59:13,435 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=997098.0, ans=0.125 2023-06-24 01:59:43,997 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=997158.0, ans=0.125 2023-06-24 02:00:05,561 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=997218.0, ans=0.2 2023-06-24 02:00:16,788 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=997278.0, ans=0.1 2023-06-24 02:00:33,152 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=997278.0, ans=0.0 2023-06-24 02:00:35,430 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.76 vs. limit=15.0 2023-06-24 02:00:41,407 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.183e+02 2.702e+02 3.112e+02 3.506e+02 5.710e+02, threshold=6.223e+02, percent-clipped=1.0 2023-06-24 02:00:41,437 INFO [train.py:996] (0/4) Epoch 6, batch 13750, loss[loss=0.1913, simple_loss=0.2464, pruned_loss=0.0681, over 21140.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.2977, pruned_loss=0.07451, over 4258537.19 frames. ], batch size: 176, lr: 5.12e-03, grad_scale: 16.0 2023-06-24 02:00:57,544 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.79 vs. limit=22.5 2023-06-24 02:01:47,470 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 02:01:47,518 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=997518.0, ans=0.1 2023-06-24 02:02:16,897 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=997578.0, ans=0.1 2023-06-24 02:02:30,754 INFO [train.py:996] (0/4) Epoch 6, batch 13800, loss[loss=0.2346, simple_loss=0.3328, pruned_loss=0.06818, over 21614.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.3039, pruned_loss=0.07435, over 4255549.16 frames. ], batch size: 263, lr: 5.12e-03, grad_scale: 16.0 2023-06-24 02:04:01,881 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=997818.0, ans=0.125 2023-06-24 02:04:01,893 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=997818.0, ans=0.125 2023-06-24 02:04:12,987 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=997878.0, ans=0.125 2023-06-24 02:04:22,872 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.067e+02 2.948e+02 3.505e+02 4.086e+02 7.226e+02, threshold=7.009e+02, percent-clipped=3.0 2023-06-24 02:04:22,903 INFO [train.py:996] (0/4) Epoch 6, batch 13850, loss[loss=0.2827, simple_loss=0.3637, pruned_loss=0.1009, over 21615.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3079, pruned_loss=0.07445, over 4254688.24 frames. ], batch size: 414, lr: 5.12e-03, grad_scale: 16.0 2023-06-24 02:04:34,032 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=997938.0, ans=0.1 2023-06-24 02:05:31,567 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 02:05:31,725 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=998058.0, ans=0.2 2023-06-24 02:05:47,369 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=998118.0, ans=0.2 2023-06-24 02:05:49,049 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=998118.0, ans=0.2 2023-06-24 02:06:17,516 INFO [train.py:996] (0/4) Epoch 6, batch 13900, loss[loss=0.2587, simple_loss=0.3254, pruned_loss=0.09602, over 21806.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3125, pruned_loss=0.07884, over 4259466.31 frames. ], batch size: 414, lr: 5.12e-03, grad_scale: 16.0 2023-06-24 02:06:29,071 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=998238.0, ans=0.0 2023-06-24 02:07:31,260 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.76 vs. limit=15.0 2023-06-24 02:07:38,124 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=6.14 vs. limit=22.5 2023-06-24 02:08:02,159 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=998478.0, ans=0.125 2023-06-24 02:08:08,422 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.130e+02 2.809e+02 3.184e+02 3.702e+02 5.147e+02, threshold=6.368e+02, percent-clipped=0.0 2023-06-24 02:08:08,453 INFO [train.py:996] (0/4) Epoch 6, batch 13950, loss[loss=0.2523, simple_loss=0.3228, pruned_loss=0.09093, over 21367.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3132, pruned_loss=0.07984, over 4263419.62 frames. ], batch size: 144, lr: 5.12e-03, grad_scale: 16.0 2023-06-24 02:08:27,706 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=998538.0, ans=0.125 2023-06-24 02:08:56,704 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=998658.0, ans=0.0 2023-06-24 02:09:04,866 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=998658.0, ans=0.1 2023-06-24 02:09:08,065 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=998658.0, ans=0.0 2023-06-24 02:09:57,214 INFO [train.py:996] (0/4) Epoch 6, batch 14000, loss[loss=0.2021, simple_loss=0.291, pruned_loss=0.05659, over 21627.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3101, pruned_loss=0.07765, over 4272393.80 frames. ], batch size: 263, lr: 5.12e-03, grad_scale: 32.0 2023-06-24 02:10:22,871 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=998898.0, ans=0.0 2023-06-24 02:11:43,077 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=999078.0, ans=0.1 2023-06-24 02:11:46,173 INFO [train.py:996] (0/4) Epoch 6, batch 14050, loss[loss=0.1813, simple_loss=0.2603, pruned_loss=0.05118, over 21422.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.3049, pruned_loss=0.07363, over 4269829.57 frames. ], batch size: 194, lr: 5.12e-03, grad_scale: 16.0 2023-06-24 02:11:47,712 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.602e+02 2.313e+02 2.760e+02 3.193e+02 4.998e+02, threshold=5.521e+02, percent-clipped=0.0 2023-06-24 02:13:21,949 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=999378.0, ans=0.1 2023-06-24 02:13:35,289 INFO [train.py:996] (0/4) Epoch 6, batch 14100, loss[loss=0.2392, simple_loss=0.311, pruned_loss=0.08374, over 21734.00 frames. ], tot_loss[loss=0.223, simple_loss=0.2997, pruned_loss=0.07313, over 4267081.57 frames. ], batch size: 351, lr: 5.12e-03, grad_scale: 16.0 2023-06-24 02:13:48,897 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=999438.0, ans=0.1 2023-06-24 02:15:09,675 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=999678.0, ans=0.5 2023-06-24 02:15:14,506 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=999738.0, ans=0.0 2023-06-24 02:15:15,652 INFO [train.py:996] (0/4) Epoch 6, batch 14150, loss[loss=0.2375, simple_loss=0.3185, pruned_loss=0.07824, over 21842.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.3033, pruned_loss=0.07379, over 4256817.96 frames. ], batch size: 107, lr: 5.12e-03, grad_scale: 16.0 2023-06-24 02:15:17,251 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.953e+02 2.422e+02 2.767e+02 3.253e+02 5.449e+02, threshold=5.534e+02, percent-clipped=0.0 2023-06-24 02:15:19,390 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_na.min_abs, batch_count=999738.0, ans=0.02 2023-06-24 02:16:02,490 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=999858.0, ans=0.2 2023-06-24 02:17:02,004 INFO [train.py:996] (0/4) Epoch 6, batch 14200, loss[loss=0.1964, simple_loss=0.2648, pruned_loss=0.06402, over 21602.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.3007, pruned_loss=0.07201, over 4261533.48 frames. ], batch size: 230, lr: 5.11e-03, grad_scale: 16.0 2023-06-24 02:17:30,022 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1000098.0, ans=0.125 2023-06-24 02:17:41,778 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1000098.0, ans=0.125 2023-06-24 02:18:32,314 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.13 vs. limit=15.0 2023-06-24 02:18:37,145 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1000278.0, ans=0.0 2023-06-24 02:18:42,561 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1000278.0, ans=0.125 2023-06-24 02:18:52,100 INFO [train.py:996] (0/4) Epoch 6, batch 14250, loss[loss=0.1957, simple_loss=0.2783, pruned_loss=0.05652, over 21753.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.2957, pruned_loss=0.07258, over 4249048.39 frames. ], batch size: 371, lr: 5.11e-03, grad_scale: 16.0 2023-06-24 02:18:53,608 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.840e+02 2.255e+02 2.600e+02 3.105e+02 6.584e+02, threshold=5.199e+02, percent-clipped=1.0 2023-06-24 02:19:40,165 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.19 vs. limit=15.0 2023-06-24 02:19:41,605 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 02:20:44,860 INFO [train.py:996] (0/4) Epoch 6, batch 14300, loss[loss=0.29, simple_loss=0.3875, pruned_loss=0.09626, over 21750.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.2942, pruned_loss=0.07152, over 4244090.84 frames. ], batch size: 351, lr: 5.11e-03, grad_scale: 16.0 2023-06-24 02:21:08,506 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1000698.0, ans=0.125 2023-06-24 02:21:30,576 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.40 vs. limit=15.0 2023-06-24 02:22:01,392 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1000818.0, ans=0.0 2023-06-24 02:22:10,351 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1000818.0, ans=0.125 2023-06-24 02:22:13,547 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1000818.0, ans=0.1 2023-06-24 02:22:34,165 INFO [train.py:996] (0/4) Epoch 6, batch 14350, loss[loss=0.2298, simple_loss=0.2846, pruned_loss=0.08748, over 20007.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.2997, pruned_loss=0.07258, over 4247428.66 frames. ], batch size: 703, lr: 5.11e-03, grad_scale: 16.0 2023-06-24 02:22:36,007 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.695e+02 2.573e+02 3.287e+02 4.161e+02 6.824e+02, threshold=6.573e+02, percent-clipped=7.0 2023-06-24 02:22:54,055 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1000938.0, ans=0.125 2023-06-24 02:23:15,047 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1000998.0, ans=0.125 2023-06-24 02:23:26,623 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1001058.0, ans=0.2 2023-06-24 02:23:30,239 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1001058.0, ans=0.0 2023-06-24 02:24:18,008 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1001178.0, ans=0.0 2023-06-24 02:24:25,758 INFO [train.py:996] (0/4) Epoch 6, batch 14400, loss[loss=0.2078, simple_loss=0.2685, pruned_loss=0.07358, over 21581.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.2992, pruned_loss=0.07386, over 4258928.34 frames. ], batch size: 230, lr: 5.11e-03, grad_scale: 32.0 2023-06-24 02:24:26,269 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1001238.0, ans=0.1 2023-06-24 02:24:26,866 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.58 vs. limit=22.5 2023-06-24 02:24:56,235 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1001298.0, ans=0.0 2023-06-24 02:25:35,722 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.77 vs. limit=22.5 2023-06-24 02:25:42,243 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1001418.0, ans=0.125 2023-06-24 02:25:45,903 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1001418.0, ans=0.0 2023-06-24 02:25:49,526 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.52 vs. limit=22.5 2023-06-24 02:26:09,491 INFO [train.py:996] (0/4) Epoch 6, batch 14450, loss[loss=0.2191, simple_loss=0.2747, pruned_loss=0.08172, over 21279.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.2944, pruned_loss=0.07407, over 4264069.64 frames. ], batch size: 176, lr: 5.11e-03, grad_scale: 32.0 2023-06-24 02:26:16,061 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.121e+02 2.443e+02 2.785e+02 3.113e+02 5.962e+02, threshold=5.570e+02, percent-clipped=0.0 2023-06-24 02:27:18,470 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.84 vs. limit=15.0 2023-06-24 02:27:38,242 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1001778.0, ans=0.1 2023-06-24 02:27:53,944 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1001778.0, ans=0.0 2023-06-24 02:27:56,678 INFO [train.py:996] (0/4) Epoch 6, batch 14500, loss[loss=0.2278, simple_loss=0.2842, pruned_loss=0.08567, over 21163.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2926, pruned_loss=0.07406, over 4258808.22 frames. ], batch size: 143, lr: 5.11e-03, grad_scale: 32.0 2023-06-24 02:29:01,136 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1001958.0, ans=0.125 2023-06-24 02:29:18,755 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1002018.0, ans=0.2 2023-06-24 02:29:52,404 INFO [train.py:996] (0/4) Epoch 6, batch 14550, loss[loss=0.212, simple_loss=0.3064, pruned_loss=0.05878, over 19773.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.2978, pruned_loss=0.07551, over 4257014.48 frames. ], batch size: 702, lr: 5.11e-03, grad_scale: 32.0 2023-06-24 02:30:01,995 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.903e+02 2.448e+02 2.869e+02 3.616e+02 7.079e+02, threshold=5.738e+02, percent-clipped=4.0 2023-06-24 02:30:20,969 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.23 vs. limit=15.0 2023-06-24 02:31:02,954 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1002318.0, ans=0.125 2023-06-24 02:31:06,169 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1002318.0, ans=0.95 2023-06-24 02:31:35,233 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1002378.0, ans=0.1 2023-06-24 02:31:46,402 INFO [train.py:996] (0/4) Epoch 6, batch 14600, loss[loss=0.2369, simple_loss=0.3228, pruned_loss=0.07551, over 21284.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3053, pruned_loss=0.07897, over 4262004.78 frames. ], batch size: 159, lr: 5.11e-03, grad_scale: 16.0 2023-06-24 02:32:02,450 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1002498.0, ans=0.125 2023-06-24 02:32:04,136 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1002498.0, ans=0.125 2023-06-24 02:32:57,777 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1002618.0, ans=0.1 2023-06-24 02:33:15,058 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1002678.0, ans=0.125 2023-06-24 02:33:16,781 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1002678.0, ans=0.0 2023-06-24 02:33:28,146 INFO [train.py:996] (0/4) Epoch 6, batch 14650, loss[loss=0.1819, simple_loss=0.2776, pruned_loss=0.04311, over 21762.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.307, pruned_loss=0.07774, over 4257373.49 frames. ], batch size: 332, lr: 5.11e-03, grad_scale: 16.0 2023-06-24 02:33:31,375 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.829e+02 2.911e+02 3.568e+02 4.716e+02 7.092e+02, threshold=7.135e+02, percent-clipped=11.0 2023-06-24 02:34:48,850 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1002978.0, ans=0.1 2023-06-24 02:34:50,504 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1002978.0, ans=0.125 2023-06-24 02:35:15,507 INFO [train.py:996] (0/4) Epoch 6, batch 14700, loss[loss=0.247, simple_loss=0.3071, pruned_loss=0.0935, over 20076.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.3006, pruned_loss=0.07297, over 4241853.45 frames. ], batch size: 702, lr: 5.11e-03, grad_scale: 16.0 2023-06-24 02:35:17,895 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1003038.0, ans=0.125 2023-06-24 02:35:17,934 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1003038.0, ans=0.125 2023-06-24 02:35:23,265 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1003038.0, ans=0.0 2023-06-24 02:35:23,958 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=11.19 vs. limit=15.0 2023-06-24 02:35:55,331 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1003098.0, ans=0.1 2023-06-24 02:37:05,512 INFO [train.py:996] (0/4) Epoch 6, batch 14750, loss[loss=0.2979, simple_loss=0.3853, pruned_loss=0.1053, over 21216.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.3059, pruned_loss=0.07566, over 4253031.97 frames. ], batch size: 548, lr: 5.11e-03, grad_scale: 16.0 2023-06-24 02:37:08,874 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.655e+02 2.584e+02 3.183e+02 3.769e+02 5.952e+02, threshold=6.365e+02, percent-clipped=0.0 2023-06-24 02:37:32,811 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1003398.0, ans=0.2 2023-06-24 02:37:43,414 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1003398.0, ans=0.125 2023-06-24 02:38:32,493 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.59 vs. limit=12.0 2023-06-24 02:38:59,539 INFO [train.py:996] (0/4) Epoch 6, batch 14800, loss[loss=0.2964, simple_loss=0.3365, pruned_loss=0.1282, over 21334.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3182, pruned_loss=0.08182, over 4264626.07 frames. ], batch size: 507, lr: 5.11e-03, grad_scale: 32.0 2023-06-24 02:39:00,074 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1003638.0, ans=0.0 2023-06-24 02:39:28,523 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1003698.0, ans=0.07 2023-06-24 02:39:56,840 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1003758.0, ans=0.09899494936611666 2023-06-24 02:40:55,584 INFO [train.py:996] (0/4) Epoch 6, batch 14850, loss[loss=0.211, simple_loss=0.2796, pruned_loss=0.07122, over 21570.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3117, pruned_loss=0.08089, over 4253255.66 frames. ], batch size: 263, lr: 5.10e-03, grad_scale: 32.0 2023-06-24 02:40:59,019 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.106e+02 2.678e+02 3.116e+02 4.005e+02 6.901e+02, threshold=6.233e+02, percent-clipped=1.0 2023-06-24 02:41:20,677 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.93 vs. limit=15.0 2023-06-24 02:41:35,793 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1004058.0, ans=0.0 2023-06-24 02:42:15,394 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1004118.0, ans=0.125 2023-06-24 02:42:16,241 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=14.91 vs. limit=15.0 2023-06-24 02:42:19,469 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1004118.0, ans=0.125 2023-06-24 02:42:21,094 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1004118.0, ans=0.125 2023-06-24 02:42:34,269 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.75 vs. limit=15.0 2023-06-24 02:42:47,054 INFO [train.py:996] (0/4) Epoch 6, batch 14900, loss[loss=0.2417, simple_loss=0.3156, pruned_loss=0.08386, over 21380.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.314, pruned_loss=0.08151, over 4256459.19 frames. ], batch size: 549, lr: 5.10e-03, grad_scale: 32.0 2023-06-24 02:43:32,009 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1004358.0, ans=0.125 2023-06-24 02:43:40,934 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1004358.0, ans=0.125 2023-06-24 02:44:08,854 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1004418.0, ans=0.125 2023-06-24 02:44:36,519 INFO [train.py:996] (0/4) Epoch 6, batch 14950, loss[loss=0.2485, simple_loss=0.3314, pruned_loss=0.08278, over 21632.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.3175, pruned_loss=0.08272, over 4260555.67 frames. ], batch size: 441, lr: 5.10e-03, grad_scale: 32.0 2023-06-24 02:44:39,950 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.887e+02 2.635e+02 3.010e+02 3.574e+02 5.643e+02, threshold=6.019e+02, percent-clipped=0.0 2023-06-24 02:45:30,009 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1004658.0, ans=0.0 2023-06-24 02:45:34,841 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1004658.0, ans=0.0 2023-06-24 02:45:39,947 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1004658.0, ans=0.125 2023-06-24 02:46:09,966 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_na.min_abs, batch_count=1004778.0, ans=0.02 2023-06-24 02:46:16,915 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1004778.0, ans=0.125 2023-06-24 02:46:24,992 INFO [train.py:996] (0/4) Epoch 6, batch 15000, loss[loss=0.2394, simple_loss=0.3096, pruned_loss=0.08454, over 21394.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3194, pruned_loss=0.08393, over 4268205.43 frames. ], batch size: 176, lr: 5.10e-03, grad_scale: 32.0 2023-06-24 02:46:24,993 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-24 02:46:45,300 INFO [train.py:1028] (0/4) Epoch 6, validation: loss=0.2621, simple_loss=0.3511, pruned_loss=0.08652, over 1796401.00 frames. 2023-06-24 02:46:45,301 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23616MB 2023-06-24 02:46:49,753 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1004838.0, ans=0.0 2023-06-24 02:47:04,584 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1004838.0, ans=0.0 2023-06-24 02:48:16,382 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.06 vs. limit=10.0 2023-06-24 02:48:36,407 INFO [train.py:996] (0/4) Epoch 6, batch 15050, loss[loss=0.265, simple_loss=0.3648, pruned_loss=0.08261, over 21170.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3186, pruned_loss=0.08396, over 4268513.24 frames. ], batch size: 548, lr: 5.10e-03, grad_scale: 32.0 2023-06-24 02:48:45,232 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.077e+02 2.748e+02 3.194e+02 3.808e+02 5.890e+02, threshold=6.387e+02, percent-clipped=0.0 2023-06-24 02:49:11,598 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.55 vs. limit=15.0 2023-06-24 02:50:11,356 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=1005378.0, ans=15.0 2023-06-24 02:50:31,398 INFO [train.py:996] (0/4) Epoch 6, batch 15100, loss[loss=0.2683, simple_loss=0.3363, pruned_loss=0.1001, over 21316.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.3212, pruned_loss=0.08362, over 4276507.14 frames. ], batch size: 548, lr: 5.10e-03, grad_scale: 32.0 2023-06-24 02:50:51,919 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1005438.0, ans=0.125 2023-06-24 02:51:41,392 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1005618.0, ans=0.125 2023-06-24 02:51:56,566 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1005678.0, ans=0.125 2023-06-24 02:52:20,514 INFO [train.py:996] (0/4) Epoch 6, batch 15150, loss[loss=0.2022, simple_loss=0.2717, pruned_loss=0.06634, over 21742.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.318, pruned_loss=0.08443, over 4275778.67 frames. ], batch size: 112, lr: 5.10e-03, grad_scale: 32.0 2023-06-24 02:52:29,946 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.002e+02 2.489e+02 2.718e+02 3.127e+02 6.231e+02, threshold=5.435e+02, percent-clipped=0.0 2023-06-24 02:52:52,235 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.73 vs. limit=15.0 2023-06-24 02:53:40,246 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.02 vs. limit=15.0 2023-06-24 02:54:14,607 INFO [train.py:996] (0/4) Epoch 6, batch 15200, loss[loss=0.1938, simple_loss=0.2777, pruned_loss=0.05491, over 21646.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3085, pruned_loss=0.07983, over 4275497.66 frames. ], batch size: 247, lr: 5.10e-03, grad_scale: 32.0 2023-06-24 02:54:17,357 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.27 vs. limit=10.0 2023-06-24 02:54:56,475 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1006158.0, ans=0.0 2023-06-24 02:55:22,584 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1006218.0, ans=0.0 2023-06-24 02:56:03,338 INFO [train.py:996] (0/4) Epoch 6, batch 15250, loss[loss=0.2045, simple_loss=0.2698, pruned_loss=0.06956, over 21683.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3024, pruned_loss=0.07787, over 4274421.24 frames. ], batch size: 333, lr: 5.10e-03, grad_scale: 16.0 2023-06-24 02:56:13,795 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.617e+02 2.536e+02 2.850e+02 3.419e+02 5.207e+02, threshold=5.701e+02, percent-clipped=0.0 2023-06-24 02:56:27,221 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1006398.0, ans=0.125 2023-06-24 02:56:28,681 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1006398.0, ans=0.0 2023-06-24 02:56:52,496 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1006458.0, ans=0.125 2023-06-24 02:57:06,442 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1006518.0, ans=0.0 2023-06-24 02:57:33,326 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=6.92 vs. limit=22.5 2023-06-24 02:57:34,784 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1006578.0, ans=0.125 2023-06-24 02:57:58,585 INFO [train.py:996] (0/4) Epoch 6, batch 15300, loss[loss=0.2465, simple_loss=0.3177, pruned_loss=0.08762, over 21987.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3046, pruned_loss=0.07944, over 4266733.75 frames. ], batch size: 317, lr: 5.10e-03, grad_scale: 16.0 2023-06-24 02:58:33,226 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1006698.0, ans=0.0 2023-06-24 02:59:48,119 INFO [train.py:996] (0/4) Epoch 6, batch 15350, loss[loss=0.2184, simple_loss=0.3239, pruned_loss=0.05646, over 21779.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.309, pruned_loss=0.08146, over 4274983.30 frames. ], batch size: 247, lr: 5.10e-03, grad_scale: 16.0 2023-06-24 02:59:52,936 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.904e+02 2.681e+02 3.062e+02 3.788e+02 5.909e+02, threshold=6.124e+02, percent-clipped=1.0 2023-06-24 03:00:41,888 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.28 vs. limit=15.0 2023-06-24 03:01:07,883 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.30 vs. limit=22.5 2023-06-24 03:01:17,608 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1007178.0, ans=0.0 2023-06-24 03:01:23,791 INFO [train.py:996] (0/4) Epoch 6, batch 15400, loss[loss=0.2347, simple_loss=0.3111, pruned_loss=0.07909, over 21877.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3097, pruned_loss=0.08024, over 4278864.67 frames. ], batch size: 351, lr: 5.10e-03, grad_scale: 16.0 2023-06-24 03:01:53,989 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1007298.0, ans=0.0 2023-06-24 03:01:54,066 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1007298.0, ans=0.125 2023-06-24 03:02:16,655 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=1007358.0, ans=15.0 2023-06-24 03:02:23,361 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1007358.0, ans=0.2 2023-06-24 03:03:00,709 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1007478.0, ans=0.125 2023-06-24 03:03:12,816 INFO [train.py:996] (0/4) Epoch 6, batch 15450, loss[loss=0.2026, simple_loss=0.2714, pruned_loss=0.06687, over 21678.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3074, pruned_loss=0.07924, over 4275791.96 frames. ], batch size: 263, lr: 5.10e-03, grad_scale: 16.0 2023-06-24 03:03:23,433 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.018e+02 2.379e+02 2.689e+02 3.180e+02 6.204e+02, threshold=5.379e+02, percent-clipped=1.0 2023-06-24 03:05:07,258 INFO [train.py:996] (0/4) Epoch 6, batch 15500, loss[loss=0.252, simple_loss=0.3217, pruned_loss=0.09111, over 21591.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3107, pruned_loss=0.07846, over 4267008.50 frames. ], batch size: 263, lr: 5.09e-03, grad_scale: 16.0 2023-06-24 03:05:17,602 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.64 vs. limit=22.5 2023-06-24 03:05:29,250 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1007898.0, ans=0.125 2023-06-24 03:06:04,035 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-168000.pt 2023-06-24 03:06:39,677 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.11 vs. limit=22.5 2023-06-24 03:06:58,632 INFO [train.py:996] (0/4) Epoch 6, batch 15550, loss[loss=0.2834, simple_loss=0.3989, pruned_loss=0.08396, over 19755.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3112, pruned_loss=0.07726, over 4260691.79 frames. ], batch size: 702, lr: 5.09e-03, grad_scale: 16.0 2023-06-24 03:07:03,918 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.019e+02 2.505e+02 2.792e+02 3.296e+02 4.983e+02, threshold=5.584e+02, percent-clipped=0.0 2023-06-24 03:07:21,301 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1008198.0, ans=0.2 2023-06-24 03:08:00,440 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1008318.0, ans=0.0 2023-06-24 03:08:46,218 INFO [train.py:996] (0/4) Epoch 6, batch 15600, loss[loss=0.2064, simple_loss=0.2744, pruned_loss=0.06918, over 21920.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.3042, pruned_loss=0.07534, over 4265139.81 frames. ], batch size: 125, lr: 5.09e-03, grad_scale: 32.0 2023-06-24 03:09:10,715 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1008498.0, ans=10.0 2023-06-24 03:10:33,937 INFO [train.py:996] (0/4) Epoch 6, batch 15650, loss[loss=0.2107, simple_loss=0.2797, pruned_loss=0.07086, over 21625.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.3027, pruned_loss=0.07527, over 4261177.46 frames. ], batch size: 332, lr: 5.09e-03, grad_scale: 32.0 2023-06-24 03:10:39,278 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.960e+02 2.465e+02 2.724e+02 3.048e+02 4.286e+02, threshold=5.447e+02, percent-clipped=0.0 2023-06-24 03:11:20,485 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1008858.0, ans=0.0 2023-06-24 03:11:49,807 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1008918.0, ans=0.1 2023-06-24 03:11:54,813 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1008918.0, ans=0.1 2023-06-24 03:12:21,553 INFO [train.py:996] (0/4) Epoch 6, batch 15700, loss[loss=0.2044, simple_loss=0.2809, pruned_loss=0.064, over 21839.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.2979, pruned_loss=0.07427, over 4269001.68 frames. ], batch size: 372, lr: 5.09e-03, grad_scale: 32.0 2023-06-24 03:13:03,917 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1009098.0, ans=0.125 2023-06-24 03:13:24,166 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.08 vs. limit=5.0 2023-06-24 03:13:43,269 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.80 vs. limit=15.0 2023-06-24 03:14:00,916 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1009278.0, ans=0.125 2023-06-24 03:14:08,905 INFO [train.py:996] (0/4) Epoch 6, batch 15750, loss[loss=0.1999, simple_loss=0.2734, pruned_loss=0.0632, over 21480.00 frames. ], tot_loss[loss=0.222, simple_loss=0.2948, pruned_loss=0.07464, over 4263639.14 frames. ], batch size: 212, lr: 5.09e-03, grad_scale: 32.0 2023-06-24 03:14:14,106 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.808e+02 2.454e+02 2.677e+02 3.133e+02 4.467e+02, threshold=5.354e+02, percent-clipped=0.0 2023-06-24 03:14:23,984 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.48 vs. limit=6.0 2023-06-24 03:14:25,515 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.96 vs. limit=22.5 2023-06-24 03:15:02,155 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1009458.0, ans=0.125 2023-06-24 03:15:45,071 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=1009578.0, ans=22.5 2023-06-24 03:15:57,566 INFO [train.py:996] (0/4) Epoch 6, batch 15800, loss[loss=0.2244, simple_loss=0.329, pruned_loss=0.05991, over 20790.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.2902, pruned_loss=0.07409, over 4270783.56 frames. ], batch size: 608, lr: 5.09e-03, grad_scale: 32.0 2023-06-24 03:16:02,164 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1009638.0, ans=0.0 2023-06-24 03:16:12,422 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 03:17:08,438 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1009818.0, ans=0.2 2023-06-24 03:17:45,449 INFO [train.py:996] (0/4) Epoch 6, batch 15850, loss[loss=0.2468, simple_loss=0.3189, pruned_loss=0.08739, over 21261.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.2935, pruned_loss=0.07695, over 4275645.37 frames. ], batch size: 143, lr: 5.09e-03, grad_scale: 32.0 2023-06-24 03:17:46,689 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.17 vs. limit=15.0 2023-06-24 03:17:50,485 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.171e+02 2.697e+02 2.988e+02 3.672e+02 5.659e+02, threshold=5.976e+02, percent-clipped=2.0 2023-06-24 03:17:56,213 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1009938.0, ans=0.1 2023-06-24 03:18:32,651 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1010058.0, ans=0.125 2023-06-24 03:18:41,363 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1010058.0, ans=0.0 2023-06-24 03:19:32,102 INFO [train.py:996] (0/4) Epoch 6, batch 15900, loss[loss=0.2294, simple_loss=0.2975, pruned_loss=0.08061, over 21468.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.2891, pruned_loss=0.07607, over 4263945.99 frames. ], batch size: 389, lr: 5.09e-03, grad_scale: 32.0 2023-06-24 03:19:45,114 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1010238.0, ans=0.125 2023-06-24 03:20:28,575 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1010358.0, ans=0.0 2023-06-24 03:21:04,860 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1010478.0, ans=10.0 2023-06-24 03:21:19,563 INFO [train.py:996] (0/4) Epoch 6, batch 15950, loss[loss=0.2011, simple_loss=0.2901, pruned_loss=0.05603, over 21500.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2885, pruned_loss=0.07335, over 4260537.28 frames. ], batch size: 471, lr: 5.09e-03, grad_scale: 32.0 2023-06-24 03:21:24,433 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.544e+02 2.251e+02 2.569e+02 3.023e+02 4.641e+02, threshold=5.138e+02, percent-clipped=0.0 2023-06-24 03:21:29,303 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.88 vs. limit=15.0 2023-06-24 03:21:42,953 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=1010598.0, ans=22.5 2023-06-24 03:21:51,680 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1010598.0, ans=0.0 2023-06-24 03:22:20,962 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1010718.0, ans=0.0 2023-06-24 03:22:29,541 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1010718.0, ans=0.1 2023-06-24 03:22:29,604 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1010718.0, ans=0.125 2023-06-24 03:22:37,393 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.37 vs. limit=10.0 2023-06-24 03:22:39,705 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1010718.0, ans=0.125 2023-06-24 03:22:48,986 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.29 vs. limit=22.5 2023-06-24 03:23:07,205 INFO [train.py:996] (0/4) Epoch 6, batch 16000, loss[loss=0.1901, simple_loss=0.2768, pruned_loss=0.05167, over 21676.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2896, pruned_loss=0.07137, over 4266646.16 frames. ], batch size: 263, lr: 5.09e-03, grad_scale: 32.0 2023-06-24 03:23:07,761 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1010838.0, ans=0.0 2023-06-24 03:23:08,357 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.08 vs. limit=15.0 2023-06-24 03:24:25,575 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1011018.0, ans=0.1 2023-06-24 03:24:27,666 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1011018.0, ans=0.0 2023-06-24 03:24:27,671 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1011018.0, ans=0.125 2023-06-24 03:24:30,934 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1011018.0, ans=0.125 2023-06-24 03:24:53,225 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1011078.0, ans=0.125 2023-06-24 03:24:55,885 INFO [train.py:996] (0/4) Epoch 6, batch 16050, loss[loss=0.2735, simple_loss=0.3784, pruned_loss=0.08429, over 21259.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2932, pruned_loss=0.06998, over 4267648.72 frames. ], batch size: 548, lr: 5.09e-03, grad_scale: 16.0 2023-06-24 03:25:02,638 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.897e+02 2.499e+02 2.877e+02 3.627e+02 5.675e+02, threshold=5.753e+02, percent-clipped=3.0 2023-06-24 03:25:03,327 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1011138.0, ans=0.0 2023-06-24 03:25:49,616 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1011258.0, ans=0.0 2023-06-24 03:26:42,164 INFO [train.py:996] (0/4) Epoch 6, batch 16100, loss[loss=0.2147, simple_loss=0.2824, pruned_loss=0.07352, over 21907.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.2999, pruned_loss=0.07174, over 4270909.52 frames. ], batch size: 316, lr: 5.09e-03, grad_scale: 16.0 2023-06-24 03:27:01,922 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1011498.0, ans=0.1 2023-06-24 03:27:57,588 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1011618.0, ans=0.1 2023-06-24 03:28:08,448 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1011678.0, ans=0.0 2023-06-24 03:28:13,772 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1011678.0, ans=0.0 2023-06-24 03:28:13,816 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1011678.0, ans=0.0 2023-06-24 03:28:23,745 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1011678.0, ans=0.2 2023-06-24 03:28:24,293 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.14 vs. limit=22.5 2023-06-24 03:28:31,502 INFO [train.py:996] (0/4) Epoch 6, batch 16150, loss[loss=0.2241, simple_loss=0.3366, pruned_loss=0.05577, over 19868.00 frames. ], tot_loss[loss=0.225, simple_loss=0.3017, pruned_loss=0.07417, over 4276689.42 frames. ], batch size: 702, lr: 5.08e-03, grad_scale: 16.0 2023-06-24 03:28:38,461 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.889e+02 2.535e+02 2.977e+02 3.474e+02 6.271e+02, threshold=5.955e+02, percent-clipped=2.0 2023-06-24 03:29:22,136 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1011858.0, ans=0.2 2023-06-24 03:29:22,186 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 03:29:46,731 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1011918.0, ans=0.04949747468305833 2023-06-24 03:30:10,179 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.75 vs. limit=10.0 2023-06-24 03:30:21,152 INFO [train.py:996] (0/4) Epoch 6, batch 16200, loss[loss=0.2271, simple_loss=0.3026, pruned_loss=0.07578, over 21289.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3046, pruned_loss=0.07553, over 4287861.98 frames. ], batch size: 159, lr: 5.08e-03, grad_scale: 16.0 2023-06-24 03:31:06,697 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.61 vs. limit=22.5 2023-06-24 03:32:09,583 INFO [train.py:996] (0/4) Epoch 6, batch 16250, loss[loss=0.1774, simple_loss=0.252, pruned_loss=0.05141, over 21251.00 frames. ], tot_loss[loss=0.229, simple_loss=0.3052, pruned_loss=0.07637, over 4292574.61 frames. ], batch size: 176, lr: 5.08e-03, grad_scale: 16.0 2023-06-24 03:32:16,274 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.059e+02 2.579e+02 2.975e+02 3.411e+02 5.928e+02, threshold=5.950e+02, percent-clipped=0.0 2023-06-24 03:32:28,220 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.35 vs. limit=15.0 2023-06-24 03:32:52,649 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1012398.0, ans=0.2 2023-06-24 03:33:02,962 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.17 vs. limit=12.0 2023-06-24 03:33:57,767 INFO [train.py:996] (0/4) Epoch 6, batch 16300, loss[loss=0.2354, simple_loss=0.3208, pruned_loss=0.07505, over 20705.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.2988, pruned_loss=0.07237, over 4282902.80 frames. ], batch size: 607, lr: 5.08e-03, grad_scale: 16.0 2023-06-24 03:35:45,294 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1012878.0, ans=0.125 2023-06-24 03:35:47,989 INFO [train.py:996] (0/4) Epoch 6, batch 16350, loss[loss=0.3036, simple_loss=0.3651, pruned_loss=0.1211, over 21402.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.2994, pruned_loss=0.07344, over 4280045.51 frames. ], batch size: 471, lr: 5.08e-03, grad_scale: 16.0 2023-06-24 03:36:00,043 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.811e+02 2.290e+02 2.661e+02 3.043e+02 4.876e+02, threshold=5.321e+02, percent-clipped=0.0 2023-06-24 03:36:02,906 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.04 vs. limit=6.0 2023-06-24 03:36:16,781 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1012998.0, ans=0.2 2023-06-24 03:36:29,206 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1012998.0, ans=0.0 2023-06-24 03:36:45,320 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1013058.0, ans=0.125 2023-06-24 03:36:48,890 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1013058.0, ans=0.2 2023-06-24 03:37:36,544 INFO [train.py:996] (0/4) Epoch 6, batch 16400, loss[loss=0.2233, simple_loss=0.2938, pruned_loss=0.07639, over 21913.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.3026, pruned_loss=0.07519, over 4279413.27 frames. ], batch size: 107, lr: 5.08e-03, grad_scale: 32.0 2023-06-24 03:37:51,159 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1013238.0, ans=0.1 2023-06-24 03:38:17,601 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1013298.0, ans=0.125 2023-06-24 03:38:18,462 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.20 vs. limit=5.0 2023-06-24 03:38:42,048 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1013358.0, ans=0.0 2023-06-24 03:38:58,310 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.33 vs. limit=15.0 2023-06-24 03:39:30,146 INFO [train.py:996] (0/4) Epoch 6, batch 16450, loss[loss=0.2298, simple_loss=0.3011, pruned_loss=0.07926, over 21876.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.3019, pruned_loss=0.07574, over 4288973.99 frames. ], batch size: 371, lr: 5.08e-03, grad_scale: 32.0 2023-06-24 03:39:42,781 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.970e+02 2.477e+02 2.722e+02 3.151e+02 4.827e+02, threshold=5.443e+02, percent-clipped=0.0 2023-06-24 03:40:30,433 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.31 vs. limit=15.0 2023-06-24 03:40:38,884 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1013718.0, ans=0.125 2023-06-24 03:41:26,349 INFO [train.py:996] (0/4) Epoch 6, batch 16500, loss[loss=0.1871, simple_loss=0.2547, pruned_loss=0.05973, over 21447.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3011, pruned_loss=0.07582, over 4285137.80 frames. ], batch size: 211, lr: 5.08e-03, grad_scale: 32.0 2023-06-24 03:41:28,790 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1013838.0, ans=0.0 2023-06-24 03:41:30,326 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=1013838.0, ans=0.05 2023-06-24 03:41:34,032 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1013838.0, ans=0.125 2023-06-24 03:42:06,980 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 03:43:15,533 INFO [train.py:996] (0/4) Epoch 6, batch 16550, loss[loss=0.2302, simple_loss=0.3148, pruned_loss=0.07275, over 21700.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.2958, pruned_loss=0.07288, over 4282986.82 frames. ], batch size: 351, lr: 5.08e-03, grad_scale: 32.0 2023-06-24 03:43:20,304 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.10 vs. limit=10.0 2023-06-24 03:43:22,449 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.907e+02 2.591e+02 3.154e+02 3.856e+02 7.253e+02, threshold=6.309e+02, percent-clipped=4.0 2023-06-24 03:43:30,153 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1014138.0, ans=0.2 2023-06-24 03:44:14,506 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.17 vs. limit=6.0 2023-06-24 03:45:06,835 INFO [train.py:996] (0/4) Epoch 6, batch 16600, loss[loss=0.2583, simple_loss=0.3629, pruned_loss=0.07683, over 21557.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3061, pruned_loss=0.07632, over 4285760.21 frames. ], batch size: 230, lr: 5.08e-03, grad_scale: 32.0 2023-06-24 03:45:40,320 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1014498.0, ans=0.2 2023-06-24 03:45:57,968 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1014558.0, ans=0.5 2023-06-24 03:46:11,686 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1014558.0, ans=0.125 2023-06-24 03:46:40,265 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1014678.0, ans=0.125 2023-06-24 03:46:56,044 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.31 vs. limit=15.0 2023-06-24 03:47:02,049 INFO [train.py:996] (0/4) Epoch 6, batch 16650, loss[loss=0.2644, simple_loss=0.339, pruned_loss=0.09489, over 21526.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3149, pruned_loss=0.0786, over 4281949.87 frames. ], batch size: 211, lr: 5.08e-03, grad_scale: 32.0 2023-06-24 03:47:04,539 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1014738.0, ans=0.125 2023-06-24 03:47:14,437 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.134e+02 2.632e+02 2.959e+02 3.254e+02 5.416e+02, threshold=5.917e+02, percent-clipped=0.0 2023-06-24 03:48:15,664 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1014918.0, ans=0.0 2023-06-24 03:48:22,042 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.00 vs. limit=22.5 2023-06-24 03:48:46,449 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1014978.0, ans=0.125 2023-06-24 03:48:59,635 INFO [train.py:996] (0/4) Epoch 6, batch 16700, loss[loss=0.1943, simple_loss=0.2538, pruned_loss=0.06737, over 21902.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3168, pruned_loss=0.07968, over 4278064.90 frames. ], batch size: 98, lr: 5.08e-03, grad_scale: 32.0 2023-06-24 03:49:00,765 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1015038.0, ans=10.0 2023-06-24 03:49:18,289 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.55 vs. limit=15.0 2023-06-24 03:49:37,749 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1015098.0, ans=0.125 2023-06-24 03:49:47,120 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 03:50:10,167 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=1015218.0, ans=10.0 2023-06-24 03:50:10,299 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1015218.0, ans=0.0 2023-06-24 03:50:58,260 INFO [train.py:996] (0/4) Epoch 6, batch 16750, loss[loss=0.2544, simple_loss=0.3309, pruned_loss=0.08897, over 21639.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.3192, pruned_loss=0.08214, over 4273310.75 frames. ], batch size: 263, lr: 5.08e-03, grad_scale: 16.0 2023-06-24 03:51:10,302 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 03:51:10,927 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.27 vs. limit=15.0 2023-06-24 03:51:11,147 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.27 vs. limit=10.0 2023-06-24 03:51:13,352 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.212e+02 2.841e+02 3.113e+02 3.878e+02 5.035e+02, threshold=6.225e+02, percent-clipped=0.0 2023-06-24 03:52:01,781 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1015458.0, ans=0.125 2023-06-24 03:52:54,264 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1015638.0, ans=0.125 2023-06-24 03:52:55,214 INFO [train.py:996] (0/4) Epoch 6, batch 16800, loss[loss=0.2246, simple_loss=0.3043, pruned_loss=0.0724, over 21821.00 frames. ], tot_loss[loss=0.2447, simple_loss=0.3239, pruned_loss=0.08278, over 4263255.56 frames. ], batch size: 298, lr: 5.08e-03, grad_scale: 32.0 2023-06-24 03:53:09,762 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1015638.0, ans=0.125 2023-06-24 03:53:13,028 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1015698.0, ans=0.0 2023-06-24 03:54:31,782 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.11 vs. limit=15.0 2023-06-24 03:54:32,699 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1015878.0, ans=0.0 2023-06-24 03:54:34,701 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1015878.0, ans=0.125 2023-06-24 03:54:44,500 INFO [train.py:996] (0/4) Epoch 6, batch 16850, loss[loss=0.2133, simple_loss=0.2796, pruned_loss=0.07355, over 21486.00 frames. ], tot_loss[loss=0.242, simple_loss=0.3194, pruned_loss=0.08228, over 4275603.85 frames. ], batch size: 194, lr: 5.07e-03, grad_scale: 32.0 2023-06-24 03:54:53,527 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.099e+02 2.780e+02 3.302e+02 4.313e+02 7.428e+02, threshold=6.605e+02, percent-clipped=4.0 2023-06-24 03:55:12,409 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1015998.0, ans=0.1 2023-06-24 03:55:14,971 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=7.03 vs. limit=12.0 2023-06-24 03:55:46,937 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.37 vs. limit=15.0 2023-06-24 03:55:51,187 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1016118.0, ans=0.125 2023-06-24 03:56:19,579 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1016178.0, ans=0.0 2023-06-24 03:56:32,148 INFO [train.py:996] (0/4) Epoch 6, batch 16900, loss[loss=0.2155, simple_loss=0.2815, pruned_loss=0.07479, over 21559.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3133, pruned_loss=0.07991, over 4271572.51 frames. ], batch size: 441, lr: 5.07e-03, grad_scale: 32.0 2023-06-24 03:57:10,877 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1016298.0, ans=0.125 2023-06-24 03:57:44,836 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1016418.0, ans=0.1 2023-06-24 03:57:57,021 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1016478.0, ans=0.125 2023-06-24 03:58:19,454 INFO [train.py:996] (0/4) Epoch 6, batch 16950, loss[loss=0.2175, simple_loss=0.295, pruned_loss=0.07005, over 21865.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3067, pruned_loss=0.0783, over 4278841.15 frames. ], batch size: 124, lr: 5.07e-03, grad_scale: 16.0 2023-06-24 03:58:29,699 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.847e+02 2.437e+02 2.853e+02 3.182e+02 4.700e+02, threshold=5.707e+02, percent-clipped=0.0 2023-06-24 03:58:51,210 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.44 vs. limit=10.0 2023-06-24 03:59:51,796 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1016778.0, ans=0.125 2023-06-24 04:00:03,698 INFO [train.py:996] (0/4) Epoch 6, batch 17000, loss[loss=0.2073, simple_loss=0.2794, pruned_loss=0.06759, over 21939.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3029, pruned_loss=0.07835, over 4286834.84 frames. ], batch size: 316, lr: 5.07e-03, grad_scale: 16.0 2023-06-24 04:00:11,898 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1016838.0, ans=0.1 2023-06-24 04:00:19,299 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.04 vs. limit=15.0 2023-06-24 04:01:08,340 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1017018.0, ans=0.125 2023-06-24 04:01:26,336 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1017018.0, ans=0.125 2023-06-24 04:01:44,429 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1017078.0, ans=0.125 2023-06-24 04:01:53,043 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1017138.0, ans=0.0 2023-06-24 04:01:54,098 INFO [train.py:996] (0/4) Epoch 6, batch 17050, loss[loss=0.2537, simple_loss=0.3436, pruned_loss=0.08186, over 21388.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3097, pruned_loss=0.08113, over 4289615.90 frames. ], batch size: 548, lr: 5.07e-03, grad_scale: 16.0 2023-06-24 04:02:04,590 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.138e+02 2.608e+02 3.012e+02 3.512e+02 5.895e+02, threshold=6.025e+02, percent-clipped=1.0 2023-06-24 04:02:12,577 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1017138.0, ans=0.125 2023-06-24 04:02:24,580 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1017198.0, ans=0.125 2023-06-24 04:02:47,010 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1017258.0, ans=0.125 2023-06-24 04:03:14,952 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1017318.0, ans=0.1 2023-06-24 04:03:21,689 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1017378.0, ans=0.125 2023-06-24 04:03:30,082 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1017378.0, ans=0.2 2023-06-24 04:03:36,128 INFO [train.py:996] (0/4) Epoch 6, batch 17100, loss[loss=0.2409, simple_loss=0.3152, pruned_loss=0.08329, over 21733.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3089, pruned_loss=0.08156, over 4284391.50 frames. ], batch size: 112, lr: 5.07e-03, grad_scale: 16.0 2023-06-24 04:04:03,068 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 04:05:01,741 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1017618.0, ans=0.125 2023-06-24 04:05:23,563 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.25 vs. limit=15.0 2023-06-24 04:05:23,952 INFO [train.py:996] (0/4) Epoch 6, batch 17150, loss[loss=0.1918, simple_loss=0.274, pruned_loss=0.05476, over 21793.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3039, pruned_loss=0.08058, over 4285787.34 frames. ], batch size: 332, lr: 5.07e-03, grad_scale: 16.0 2023-06-24 04:05:38,685 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1017738.0, ans=0.125 2023-06-24 04:05:44,968 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.872e+02 2.638e+02 2.899e+02 3.354e+02 4.965e+02, threshold=5.799e+02, percent-clipped=0.0 2023-06-24 04:06:04,855 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1017798.0, ans=0.125 2023-06-24 04:06:23,387 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1017858.0, ans=0.2 2023-06-24 04:06:34,145 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1017858.0, ans=0.125 2023-06-24 04:06:43,074 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1017918.0, ans=0.0 2023-06-24 04:07:07,904 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1017978.0, ans=0.125 2023-06-24 04:07:08,085 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1017978.0, ans=0.125 2023-06-24 04:07:17,732 INFO [train.py:996] (0/4) Epoch 6, batch 17200, loss[loss=0.2463, simple_loss=0.3146, pruned_loss=0.08901, over 21530.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3028, pruned_loss=0.07996, over 4291413.23 frames. ], batch size: 211, lr: 5.07e-03, grad_scale: 32.0 2023-06-24 04:07:24,287 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1018038.0, ans=0.0 2023-06-24 04:08:13,331 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1018158.0, ans=0.05 2023-06-24 04:08:33,369 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.42 vs. limit=15.0 2023-06-24 04:08:34,519 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1018218.0, ans=0.0 2023-06-24 04:09:12,568 INFO [train.py:996] (0/4) Epoch 6, batch 17250, loss[loss=0.2304, simple_loss=0.3018, pruned_loss=0.07947, over 21707.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3072, pruned_loss=0.0821, over 4284736.94 frames. ], batch size: 298, lr: 5.07e-03, grad_scale: 16.0 2023-06-24 04:09:17,363 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1018338.0, ans=0.0 2023-06-24 04:09:25,167 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.067e+02 2.699e+02 3.105e+02 3.621e+02 5.993e+02, threshold=6.210e+02, percent-clipped=1.0 2023-06-24 04:11:01,926 INFO [train.py:996] (0/4) Epoch 6, batch 17300, loss[loss=0.249, simple_loss=0.3251, pruned_loss=0.08648, over 21771.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3171, pruned_loss=0.08571, over 4282005.37 frames. ], batch size: 332, lr: 5.07e-03, grad_scale: 16.0 2023-06-24 04:12:21,780 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1018818.0, ans=0.125 2023-06-24 04:12:43,239 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1018878.0, ans=0.0 2023-06-24 04:12:53,389 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1018878.0, ans=0.0 2023-06-24 04:12:58,390 INFO [train.py:996] (0/4) Epoch 6, batch 17350, loss[loss=0.198, simple_loss=0.2358, pruned_loss=0.08008, over 19972.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.3169, pruned_loss=0.0846, over 4274108.53 frames. ], batch size: 703, lr: 5.07e-03, grad_scale: 16.0 2023-06-24 04:13:16,102 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.921e+02 2.820e+02 3.152e+02 3.644e+02 6.101e+02, threshold=6.303e+02, percent-clipped=0.0 2023-06-24 04:13:18,517 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1018938.0, ans=0.125 2023-06-24 04:13:36,809 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.12 vs. limit=15.0 2023-06-24 04:14:28,544 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1019178.0, ans=0.025 2023-06-24 04:14:40,842 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1019178.0, ans=0.04949747468305833 2023-06-24 04:14:54,377 INFO [train.py:996] (0/4) Epoch 6, batch 17400, loss[loss=0.2082, simple_loss=0.2905, pruned_loss=0.06295, over 21765.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3124, pruned_loss=0.0806, over 4272033.06 frames. ], batch size: 282, lr: 5.07e-03, grad_scale: 16.0 2023-06-24 04:15:11,142 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1019298.0, ans=0.0 2023-06-24 04:15:21,656 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1019298.0, ans=0.1 2023-06-24 04:15:21,719 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1019298.0, ans=0.1 2023-06-24 04:15:51,389 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1019358.0, ans=0.0 2023-06-24 04:15:53,224 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 04:16:32,431 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1019478.0, ans=0.125 2023-06-24 04:16:44,107 INFO [train.py:996] (0/4) Epoch 6, batch 17450, loss[loss=0.1989, simple_loss=0.3009, pruned_loss=0.04849, over 21568.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3085, pruned_loss=0.07837, over 4273437.95 frames. ], batch size: 389, lr: 5.07e-03, grad_scale: 16.0 2023-06-24 04:16:44,597 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 04:16:58,154 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.752e+02 2.361e+02 2.755e+02 3.366e+02 5.958e+02, threshold=5.511e+02, percent-clipped=0.0 2023-06-24 04:16:58,854 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1019538.0, ans=0.125 2023-06-24 04:17:21,669 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.26 vs. limit=12.0 2023-06-24 04:17:32,767 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1019658.0, ans=0.125 2023-06-24 04:18:17,068 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1019778.0, ans=0.125 2023-06-24 04:18:21,176 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.36 vs. limit=15.0 2023-06-24 04:18:26,087 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1019778.0, ans=0.125 2023-06-24 04:18:30,655 INFO [train.py:996] (0/4) Epoch 6, batch 17500, loss[loss=0.1938, simple_loss=0.2691, pruned_loss=0.05924, over 21649.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3048, pruned_loss=0.0761, over 4281846.59 frames. ], batch size: 230, lr: 5.06e-03, grad_scale: 8.0 2023-06-24 04:18:38,431 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1019838.0, ans=0.125 2023-06-24 04:18:54,708 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.51 vs. limit=12.0 2023-06-24 04:19:29,325 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1019958.0, ans=0.125 2023-06-24 04:19:38,395 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.48 vs. limit=12.0 2023-06-24 04:19:41,476 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1020018.0, ans=0.0 2023-06-24 04:20:15,388 INFO [train.py:996] (0/4) Epoch 6, batch 17550, loss[loss=0.2877, simple_loss=0.3564, pruned_loss=0.1095, over 21450.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.3054, pruned_loss=0.07503, over 4279666.60 frames. ], batch size: 507, lr: 5.06e-03, grad_scale: 8.0 2023-06-24 04:20:17,746 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1020138.0, ans=0.125 2023-06-24 04:20:27,817 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1020138.0, ans=0.125 2023-06-24 04:20:28,780 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.730e+02 2.219e+02 2.535e+02 2.795e+02 4.245e+02, threshold=5.070e+02, percent-clipped=0.0 2023-06-24 04:20:54,638 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1020258.0, ans=0.125 2023-06-24 04:21:38,440 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1020318.0, ans=0.125 2023-06-24 04:21:40,122 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1020378.0, ans=0.125 2023-06-24 04:21:58,694 INFO [train.py:996] (0/4) Epoch 6, batch 17600, loss[loss=0.2545, simple_loss=0.3311, pruned_loss=0.08897, over 21595.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.3085, pruned_loss=0.07549, over 4266612.58 frames. ], batch size: 389, lr: 5.06e-03, grad_scale: 16.0 2023-06-24 04:23:10,048 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1020618.0, ans=0.125 2023-06-24 04:23:31,711 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.19 vs. limit=10.0 2023-06-24 04:23:38,921 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1020678.0, ans=0.125 2023-06-24 04:23:48,337 INFO [train.py:996] (0/4) Epoch 6, batch 17650, loss[loss=0.2391, simple_loss=0.3123, pruned_loss=0.08299, over 21344.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.3058, pruned_loss=0.07566, over 4267546.64 frames. ], batch size: 549, lr: 5.06e-03, grad_scale: 16.0 2023-06-24 04:23:49,777 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.67 vs. limit=6.0 2023-06-24 04:24:13,252 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.840e+02 2.481e+02 3.096e+02 4.210e+02 8.151e+02, threshold=6.192e+02, percent-clipped=13.0 2023-06-24 04:25:05,935 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1020918.0, ans=0.0 2023-06-24 04:25:13,127 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1020918.0, ans=0.125 2023-06-24 04:25:19,901 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1020978.0, ans=0.2 2023-06-24 04:25:42,586 INFO [train.py:996] (0/4) Epoch 6, batch 17700, loss[loss=0.2274, simple_loss=0.3013, pruned_loss=0.0768, over 20043.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.3005, pruned_loss=0.07337, over 4268071.12 frames. ], batch size: 702, lr: 5.06e-03, grad_scale: 16.0 2023-06-24 04:25:52,215 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1021038.0, ans=0.05 2023-06-24 04:26:37,516 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1021158.0, ans=0.2 2023-06-24 04:26:42,584 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1021158.0, ans=0.125 2023-06-24 04:27:02,022 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1021278.0, ans=0.04949747468305833 2023-06-24 04:27:30,954 INFO [train.py:996] (0/4) Epoch 6, batch 17750, loss[loss=0.3067, simple_loss=0.3641, pruned_loss=0.1246, over 21433.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3079, pruned_loss=0.07651, over 4271753.90 frames. ], batch size: 471, lr: 5.06e-03, grad_scale: 16.0 2023-06-24 04:27:44,842 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.800e+02 2.598e+02 3.053e+02 3.567e+02 5.587e+02, threshold=6.107e+02, percent-clipped=0.0 2023-06-24 04:28:12,060 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1021398.0, ans=0.125 2023-06-24 04:29:20,588 INFO [train.py:996] (0/4) Epoch 6, batch 17800, loss[loss=0.1974, simple_loss=0.2809, pruned_loss=0.05696, over 21470.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3073, pruned_loss=0.07579, over 4265822.20 frames. ], batch size: 211, lr: 5.06e-03, grad_scale: 16.0 2023-06-24 04:30:18,724 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1021758.0, ans=0.5 2023-06-24 04:31:20,133 INFO [train.py:996] (0/4) Epoch 6, batch 17850, loss[loss=0.2333, simple_loss=0.2945, pruned_loss=0.08605, over 20023.00 frames. ], tot_loss[loss=0.23, simple_loss=0.3073, pruned_loss=0.07636, over 4262080.68 frames. ], batch size: 702, lr: 5.06e-03, grad_scale: 8.0 2023-06-24 04:31:34,769 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1021938.0, ans=0.035 2023-06-24 04:31:34,885 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1021938.0, ans=0.125 2023-06-24 04:31:35,790 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.909e+02 2.586e+02 3.040e+02 3.727e+02 6.886e+02, threshold=6.079e+02, percent-clipped=3.0 2023-06-24 04:32:04,096 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.66 vs. limit=15.0 2023-06-24 04:32:33,965 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1022118.0, ans=0.0 2023-06-24 04:33:07,922 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1022178.0, ans=0.125 2023-06-24 04:33:10,780 INFO [train.py:996] (0/4) Epoch 6, batch 17900, loss[loss=0.2262, simple_loss=0.3247, pruned_loss=0.06386, over 21850.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3134, pruned_loss=0.07919, over 4265713.20 frames. ], batch size: 282, lr: 5.06e-03, grad_scale: 8.0 2023-06-24 04:33:57,528 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1022358.0, ans=0.2 2023-06-24 04:34:09,419 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1022358.0, ans=10.0 2023-06-24 04:34:58,983 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.18 vs. limit=15.0 2023-06-24 04:35:01,053 INFO [train.py:996] (0/4) Epoch 6, batch 17950, loss[loss=0.183, simple_loss=0.2772, pruned_loss=0.04445, over 21623.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3121, pruned_loss=0.07557, over 4259842.70 frames. ], batch size: 263, lr: 5.06e-03, grad_scale: 8.0 2023-06-24 04:35:09,266 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.01 vs. limit=10.0 2023-06-24 04:35:16,331 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.647e+02 2.348e+02 2.616e+02 3.044e+02 5.736e+02, threshold=5.233e+02, percent-clipped=0.0 2023-06-24 04:35:27,424 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1022598.0, ans=0.125 2023-06-24 04:36:30,267 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.78 vs. limit=6.0 2023-06-24 04:36:43,956 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.96 vs. limit=5.0 2023-06-24 04:36:47,696 INFO [train.py:996] (0/4) Epoch 6, batch 18000, loss[loss=0.181, simple_loss=0.2357, pruned_loss=0.06312, over 20689.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.3048, pruned_loss=0.07427, over 4257722.63 frames. ], batch size: 607, lr: 5.06e-03, grad_scale: 16.0 2023-06-24 04:36:47,698 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-24 04:37:05,820 INFO [train.py:1028] (0/4) Epoch 6, validation: loss=0.2648, simple_loss=0.3617, pruned_loss=0.08394, over 1796401.00 frames. 2023-06-24 04:37:05,821 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23616MB 2023-06-24 04:37:46,148 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.74 vs. limit=12.0 2023-06-24 04:38:51,132 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1023078.0, ans=0.125 2023-06-24 04:38:53,348 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.31 vs. limit=15.0 2023-06-24 04:38:55,686 INFO [train.py:996] (0/4) Epoch 6, batch 18050, loss[loss=0.225, simple_loss=0.2977, pruned_loss=0.07619, over 21743.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.2982, pruned_loss=0.07271, over 4259141.92 frames. ], batch size: 124, lr: 5.06e-03, grad_scale: 16.0 2023-06-24 04:39:22,857 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.791e+02 2.462e+02 2.761e+02 3.558e+02 5.314e+02, threshold=5.521e+02, percent-clipped=1.0 2023-06-24 04:39:24,117 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.38 vs. limit=10.0 2023-06-24 04:40:06,363 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1023258.0, ans=0.0 2023-06-24 04:40:09,826 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1023318.0, ans=0.125 2023-06-24 04:40:15,184 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1023318.0, ans=0.0 2023-06-24 04:40:19,482 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.95 vs. limit=6.0 2023-06-24 04:40:27,802 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1023378.0, ans=0.125 2023-06-24 04:40:46,678 INFO [train.py:996] (0/4) Epoch 6, batch 18100, loss[loss=0.2448, simple_loss=0.3365, pruned_loss=0.07653, over 21667.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3047, pruned_loss=0.07548, over 4267918.89 frames. ], batch size: 414, lr: 5.06e-03, grad_scale: 16.0 2023-06-24 04:40:59,105 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1023438.0, ans=0.025 2023-06-24 04:42:05,162 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.09 vs. limit=6.0 2023-06-24 04:42:23,119 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1023678.0, ans=0.1 2023-06-24 04:42:30,720 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.95 vs. limit=22.5 2023-06-24 04:42:42,385 INFO [train.py:996] (0/4) Epoch 6, batch 18150, loss[loss=0.2184, simple_loss=0.2884, pruned_loss=0.07425, over 21286.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.3046, pruned_loss=0.07486, over 4251668.27 frames. ], batch size: 144, lr: 5.05e-03, grad_scale: 16.0 2023-06-24 04:43:02,497 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.917e+02 2.411e+02 2.816e+02 3.524e+02 6.086e+02, threshold=5.632e+02, percent-clipped=3.0 2023-06-24 04:43:03,319 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1023798.0, ans=0.125 2023-06-24 04:43:19,032 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1023798.0, ans=0.1 2023-06-24 04:44:18,546 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1023978.0, ans=0.125 2023-06-24 04:44:22,764 INFO [train.py:996] (0/4) Epoch 6, batch 18200, loss[loss=0.2083, simple_loss=0.2793, pruned_loss=0.06861, over 21737.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.2989, pruned_loss=0.07471, over 4260989.60 frames. ], batch size: 333, lr: 5.05e-03, grad_scale: 16.0 2023-06-24 04:45:14,148 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1024158.0, ans=0.0 2023-06-24 04:45:15,718 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1024158.0, ans=0.125 2023-06-24 04:45:15,729 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1024158.0, ans=0.125 2023-06-24 04:45:41,322 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1024218.0, ans=0.0 2023-06-24 04:45:42,001 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.18 vs. limit=15.0 2023-06-24 04:45:49,806 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1024278.0, ans=0.125 2023-06-24 04:46:01,719 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.35 vs. limit=22.5 2023-06-24 04:46:07,544 INFO [train.py:996] (0/4) Epoch 6, batch 18250, loss[loss=0.2136, simple_loss=0.2878, pruned_loss=0.06973, over 21857.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.2913, pruned_loss=0.0721, over 4267410.71 frames. ], batch size: 371, lr: 5.05e-03, grad_scale: 16.0 2023-06-24 04:46:18,785 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1024338.0, ans=0.125 2023-06-24 04:46:23,380 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.690e+02 2.233e+02 2.540e+02 3.083e+02 5.311e+02, threshold=5.080e+02, percent-clipped=0.0 2023-06-24 04:46:51,296 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1024398.0, ans=0.125 2023-06-24 04:47:10,968 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1024458.0, ans=0.125 2023-06-24 04:47:36,585 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1024518.0, ans=0.125 2023-06-24 04:47:56,492 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1024638.0, ans=0.1 2023-06-24 04:47:57,606 INFO [train.py:996] (0/4) Epoch 6, batch 18300, loss[loss=0.2597, simple_loss=0.3494, pruned_loss=0.08501, over 21783.00 frames. ], tot_loss[loss=0.2181, simple_loss=0.292, pruned_loss=0.07208, over 4263474.19 frames. ], batch size: 414, lr: 5.05e-03, grad_scale: 16.0 2023-06-24 04:48:06,797 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1024638.0, ans=0.1 2023-06-24 04:48:18,381 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1024698.0, ans=0.125 2023-06-24 04:48:35,976 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1024698.0, ans=0.0 2023-06-24 04:49:05,585 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1024818.0, ans=10.0 2023-06-24 04:49:39,604 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1024878.0, ans=0.0 2023-06-24 04:49:43,578 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1024938.0, ans=0.125 2023-06-24 04:49:44,715 INFO [train.py:996] (0/4) Epoch 6, batch 18350, loss[loss=0.2169, simple_loss=0.2882, pruned_loss=0.07287, over 21752.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.2966, pruned_loss=0.0716, over 4238874.82 frames. ], batch size: 371, lr: 5.05e-03, grad_scale: 16.0 2023-06-24 04:50:00,366 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.116e+02 2.650e+02 3.163e+02 4.128e+02 7.474e+02, threshold=6.326e+02, percent-clipped=9.0 2023-06-24 04:50:50,451 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1025058.0, ans=0.125 2023-06-24 04:51:34,330 INFO [train.py:996] (0/4) Epoch 6, batch 18400, loss[loss=0.1941, simple_loss=0.2642, pruned_loss=0.06202, over 21713.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.292, pruned_loss=0.07006, over 4233753.62 frames. ], batch size: 112, lr: 5.05e-03, grad_scale: 32.0 2023-06-24 04:51:47,258 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1025238.0, ans=0.125 2023-06-24 04:52:22,298 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=15.14 vs. limit=15.0 2023-06-24 04:53:17,767 INFO [train.py:996] (0/4) Epoch 6, batch 18450, loss[loss=0.1813, simple_loss=0.2507, pruned_loss=0.05595, over 15964.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.2883, pruned_loss=0.06673, over 4234627.39 frames. ], batch size: 60, lr: 5.05e-03, grad_scale: 32.0 2023-06-24 04:53:22,252 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1025538.0, ans=0.125 2023-06-24 04:53:33,305 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.665e+02 2.125e+02 2.326e+02 2.659e+02 4.995e+02, threshold=4.653e+02, percent-clipped=0.0 2023-06-24 04:53:57,534 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1025598.0, ans=0.125 2023-06-24 04:54:09,013 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.37 vs. limit=15.0 2023-06-24 04:54:15,520 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1025658.0, ans=0.125 2023-06-24 04:54:37,670 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1025718.0, ans=0.0 2023-06-24 04:54:58,604 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1025778.0, ans=0.125 2023-06-24 04:55:01,819 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1025778.0, ans=0.1 2023-06-24 04:55:06,516 INFO [train.py:996] (0/4) Epoch 6, batch 18500, loss[loss=0.1998, simple_loss=0.2718, pruned_loss=0.06388, over 21985.00 frames. ], tot_loss[loss=0.208, simple_loss=0.2838, pruned_loss=0.0661, over 4245261.29 frames. ], batch size: 103, lr: 5.05e-03, grad_scale: 32.0 2023-06-24 04:55:14,002 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1025838.0, ans=0.2 2023-06-24 04:55:21,725 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.83 vs. limit=15.0 2023-06-24 04:56:20,040 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.04 vs. limit=15.0 2023-06-24 04:56:52,802 INFO [train.py:996] (0/4) Epoch 6, batch 18550, loss[loss=0.2112, simple_loss=0.2741, pruned_loss=0.0741, over 21736.00 frames. ], tot_loss[loss=0.2062, simple_loss=0.2815, pruned_loss=0.06547, over 4243551.75 frames. ], batch size: 316, lr: 5.05e-03, grad_scale: 32.0 2023-06-24 04:56:53,574 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1026138.0, ans=0.1 2023-06-24 04:57:07,738 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1026138.0, ans=0.0 2023-06-24 04:57:10,441 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.895e+02 2.481e+02 2.781e+02 3.235e+02 5.250e+02, threshold=5.562e+02, percent-clipped=2.0 2023-06-24 04:57:49,426 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1026258.0, ans=0.04949747468305833 2023-06-24 04:57:53,128 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1026258.0, ans=0.125 2023-06-24 04:58:02,116 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1026318.0, ans=0.1 2023-06-24 04:58:19,632 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1026318.0, ans=0.125 2023-06-24 04:58:23,215 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1026378.0, ans=0.07 2023-06-24 04:58:31,589 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1026378.0, ans=0.0 2023-06-24 04:58:37,057 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1026378.0, ans=0.125 2023-06-24 04:58:41,456 INFO [train.py:996] (0/4) Epoch 6, batch 18600, loss[loss=0.207, simple_loss=0.272, pruned_loss=0.07097, over 15811.00 frames. ], tot_loss[loss=0.2066, simple_loss=0.2812, pruned_loss=0.06597, over 4241494.53 frames. ], batch size: 63, lr: 5.05e-03, grad_scale: 16.0 2023-06-24 04:59:39,586 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.97 vs. limit=15.0 2023-06-24 05:00:08,632 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn1.whiten.whitening_limit, batch_count=1026618.0, ans=22.5 2023-06-24 05:00:29,917 INFO [train.py:996] (0/4) Epoch 6, batch 18650, loss[loss=0.1939, simple_loss=0.2668, pruned_loss=0.06049, over 21733.00 frames. ], tot_loss[loss=0.2071, simple_loss=0.2815, pruned_loss=0.06637, over 4251696.93 frames. ], batch size: 124, lr: 5.05e-03, grad_scale: 16.0 2023-06-24 05:00:45,552 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1026798.0, ans=0.125 2023-06-24 05:00:46,764 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.811e+02 2.410e+02 2.665e+02 3.233e+02 6.336e+02, threshold=5.330e+02, percent-clipped=1.0 2023-06-24 05:01:12,623 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1026798.0, ans=0.0 2023-06-24 05:01:27,773 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1026858.0, ans=0.1 2023-06-24 05:01:58,037 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1026978.0, ans=0.125 2023-06-24 05:02:16,614 INFO [train.py:996] (0/4) Epoch 6, batch 18700, loss[loss=0.1949, simple_loss=0.2572, pruned_loss=0.0663, over 21597.00 frames. ], tot_loss[loss=0.2077, simple_loss=0.28, pruned_loss=0.06777, over 4254154.37 frames. ], batch size: 230, lr: 5.05e-03, grad_scale: 16.0 2023-06-24 05:02:28,750 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1027038.0, ans=0.1 2023-06-24 05:02:35,505 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1027098.0, ans=0.2 2023-06-24 05:02:37,207 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1027098.0, ans=0.1 2023-06-24 05:02:44,383 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1027098.0, ans=0.1 2023-06-24 05:03:04,472 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1027158.0, ans=0.125 2023-06-24 05:04:03,121 INFO [train.py:996] (0/4) Epoch 6, batch 18750, loss[loss=0.2107, simple_loss=0.2745, pruned_loss=0.07347, over 21210.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2821, pruned_loss=0.07005, over 4264252.59 frames. ], batch size: 608, lr: 5.05e-03, grad_scale: 8.0 2023-06-24 05:04:22,090 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.075e+02 2.422e+02 2.735e+02 3.202e+02 4.733e+02, threshold=5.471e+02, percent-clipped=0.0 2023-06-24 05:04:26,753 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.89 vs. limit=15.0 2023-06-24 05:05:35,776 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1027578.0, ans=0.125 2023-06-24 05:05:37,254 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1027578.0, ans=0.125 2023-06-24 05:05:50,789 INFO [train.py:996] (0/4) Epoch 6, batch 18800, loss[loss=0.1931, simple_loss=0.2748, pruned_loss=0.05577, over 21360.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2881, pruned_loss=0.07103, over 4271344.35 frames. ], batch size: 211, lr: 5.05e-03, grad_scale: 16.0 2023-06-24 05:05:56,892 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.52 vs. limit=22.5 2023-06-24 05:06:23,962 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1027698.0, ans=0.125 2023-06-24 05:06:35,408 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.23 vs. limit=6.0 2023-06-24 05:06:51,611 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.97 vs. limit=15.0 2023-06-24 05:07:38,456 INFO [train.py:996] (0/4) Epoch 6, batch 18850, loss[loss=0.1935, simple_loss=0.2712, pruned_loss=0.05795, over 21691.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2855, pruned_loss=0.0675, over 4269422.75 frames. ], batch size: 298, lr: 5.04e-03, grad_scale: 16.0 2023-06-24 05:07:57,084 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.497e+02 2.192e+02 2.570e+02 2.921e+02 4.536e+02, threshold=5.140e+02, percent-clipped=0.0 2023-06-24 05:08:55,728 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1028118.0, ans=0.125 2023-06-24 05:08:57,591 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1028118.0, ans=0.125 2023-06-24 05:09:26,034 INFO [train.py:996] (0/4) Epoch 6, batch 18900, loss[loss=0.2248, simple_loss=0.2895, pruned_loss=0.0801, over 21710.00 frames. ], tot_loss[loss=0.209, simple_loss=0.2827, pruned_loss=0.06771, over 4268146.42 frames. ], batch size: 391, lr: 5.04e-03, grad_scale: 16.0 2023-06-24 05:09:26,900 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1028238.0, ans=0.125 2023-06-24 05:09:39,524 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=10.33 vs. limit=15.0 2023-06-24 05:10:13,677 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1028358.0, ans=0.125 2023-06-24 05:10:29,881 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1028358.0, ans=0.1 2023-06-24 05:11:14,532 INFO [train.py:996] (0/4) Epoch 6, batch 18950, loss[loss=0.2452, simple_loss=0.3084, pruned_loss=0.09101, over 21747.00 frames. ], tot_loss[loss=0.2119, simple_loss=0.2841, pruned_loss=0.06987, over 4279980.94 frames. ], batch size: 441, lr: 5.04e-03, grad_scale: 16.0 2023-06-24 05:11:18,954 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1028538.0, ans=0.05 2023-06-24 05:11:39,579 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.167e+02 2.683e+02 3.004e+02 3.629e+02 6.368e+02, threshold=6.008e+02, percent-clipped=2.0 2023-06-24 05:11:48,029 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1028598.0, ans=0.125 2023-06-24 05:12:06,198 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1028658.0, ans=0.0 2023-06-24 05:12:39,724 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1028718.0, ans=0.09899494936611666 2023-06-24 05:12:43,144 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1028718.0, ans=0.0 2023-06-24 05:12:51,483 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1028778.0, ans=0.1 2023-06-24 05:13:05,445 INFO [train.py:996] (0/4) Epoch 6, batch 19000, loss[loss=0.2285, simple_loss=0.2813, pruned_loss=0.0879, over 21848.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2951, pruned_loss=0.07277, over 4283207.79 frames. ], batch size: 98, lr: 5.04e-03, grad_scale: 16.0 2023-06-24 05:14:36,963 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1029078.0, ans=0.125 2023-06-24 05:14:54,022 INFO [train.py:996] (0/4) Epoch 6, batch 19050, loss[loss=0.2447, simple_loss=0.3073, pruned_loss=0.09105, over 21205.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.3001, pruned_loss=0.07742, over 4285907.97 frames. ], batch size: 143, lr: 5.04e-03, grad_scale: 16.0 2023-06-24 05:15:13,470 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.47 vs. limit=22.5 2023-06-24 05:15:19,302 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.099e+02 2.839e+02 3.291e+02 3.950e+02 6.159e+02, threshold=6.582e+02, percent-clipped=1.0 2023-06-24 05:15:53,278 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1029258.0, ans=0.125 2023-06-24 05:16:38,778 INFO [train.py:996] (0/4) Epoch 6, batch 19100, loss[loss=0.2068, simple_loss=0.2667, pruned_loss=0.07347, over 21251.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.2979, pruned_loss=0.07827, over 4279863.92 frames. ], batch size: 548, lr: 5.04e-03, grad_scale: 16.0 2023-06-24 05:18:36,318 INFO [train.py:996] (0/4) Epoch 6, batch 19150, loss[loss=0.2806, simple_loss=0.3737, pruned_loss=0.09378, over 21657.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3001, pruned_loss=0.07882, over 4282118.21 frames. ], batch size: 441, lr: 5.04e-03, grad_scale: 16.0 2023-06-24 05:18:44,458 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.49 vs. limit=22.5 2023-06-24 05:18:45,645 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1029738.0, ans=0.0 2023-06-24 05:19:12,555 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.104e+02 2.501e+02 2.737e+02 3.196e+02 5.229e+02, threshold=5.475e+02, percent-clipped=0.0 2023-06-24 05:19:46,164 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.51 vs. limit=15.0 2023-06-24 05:19:59,602 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 05:20:12,944 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.28 vs. limit=15.0 2023-06-24 05:20:15,591 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1029978.0, ans=0.1 2023-06-24 05:20:17,147 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1029978.0, ans=0.0 2023-06-24 05:20:37,954 INFO [train.py:996] (0/4) Epoch 6, batch 19200, loss[loss=0.2631, simple_loss=0.3869, pruned_loss=0.06968, over 20690.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3109, pruned_loss=0.07925, over 4285944.04 frames. ], batch size: 607, lr: 5.04e-03, grad_scale: 16.0 2023-06-24 05:20:45,726 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1030038.0, ans=0.0 2023-06-24 05:21:22,176 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.57 vs. limit=12.0 2023-06-24 05:21:26,857 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1030158.0, ans=0.125 2023-06-24 05:22:02,481 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1030278.0, ans=0.125 2023-06-24 05:22:19,336 INFO [train.py:996] (0/4) Epoch 6, batch 19250, loss[loss=0.2287, simple_loss=0.3084, pruned_loss=0.07454, over 21568.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3092, pruned_loss=0.0738, over 4289815.65 frames. ], batch size: 471, lr: 5.04e-03, grad_scale: 16.0 2023-06-24 05:22:49,479 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1030398.0, ans=0.125 2023-06-24 05:22:50,494 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.504e+02 2.125e+02 2.467e+02 2.912e+02 4.275e+02, threshold=4.933e+02, percent-clipped=0.0 2023-06-24 05:23:59,431 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.48 vs. limit=15.0 2023-06-24 05:24:12,252 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.86 vs. limit=12.0 2023-06-24 05:24:12,812 INFO [train.py:996] (0/4) Epoch 6, batch 19300, loss[loss=0.2432, simple_loss=0.3172, pruned_loss=0.08464, over 21567.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.306, pruned_loss=0.07336, over 4289339.45 frames. ], batch size: 471, lr: 5.04e-03, grad_scale: 16.0 2023-06-24 05:24:42,463 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.92 vs. limit=22.5 2023-06-24 05:24:50,199 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1030698.0, ans=0.125 2023-06-24 05:25:09,884 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1030818.0, ans=0.0 2023-06-24 05:25:57,833 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1030878.0, ans=0.125 2023-06-24 05:25:57,867 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1030878.0, ans=0.0 2023-06-24 05:26:02,597 INFO [train.py:996] (0/4) Epoch 6, batch 19350, loss[loss=0.2265, simple_loss=0.313, pruned_loss=0.07, over 21704.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.301, pruned_loss=0.07, over 4290852.67 frames. ], batch size: 391, lr: 5.04e-03, grad_scale: 16.0 2023-06-24 05:26:03,333 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1030938.0, ans=0.0 2023-06-24 05:26:28,594 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.743e+02 2.277e+02 2.629e+02 3.333e+02 6.338e+02, threshold=5.259e+02, percent-clipped=7.0 2023-06-24 05:27:00,930 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.15 vs. limit=10.0 2023-06-24 05:27:10,505 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1031118.0, ans=0.125 2023-06-24 05:27:22,943 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1031178.0, ans=0.125 2023-06-24 05:27:50,213 INFO [train.py:996] (0/4) Epoch 6, batch 19400, loss[loss=0.1961, simple_loss=0.263, pruned_loss=0.06466, over 21674.00 frames. ], tot_loss[loss=0.219, simple_loss=0.2992, pruned_loss=0.06943, over 4291923.27 frames. ], batch size: 230, lr: 5.04e-03, grad_scale: 16.0 2023-06-24 05:28:14,053 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=1031298.0, ans=10.0 2023-06-24 05:28:45,988 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1031358.0, ans=0.0 2023-06-24 05:28:58,285 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1031418.0, ans=0.1 2023-06-24 05:29:24,678 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1031478.0, ans=0.125 2023-06-24 05:29:37,934 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1031478.0, ans=0.0 2023-06-24 05:29:44,562 INFO [train.py:996] (0/4) Epoch 6, batch 19450, loss[loss=0.2375, simple_loss=0.3037, pruned_loss=0.08566, over 21962.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.2971, pruned_loss=0.07125, over 4295608.91 frames. ], batch size: 119, lr: 5.04e-03, grad_scale: 16.0 2023-06-24 05:29:47,558 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.17 vs. limit=22.5 2023-06-24 05:30:05,478 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.971e+02 2.505e+02 2.907e+02 3.403e+02 7.011e+02, threshold=5.814e+02, percent-clipped=3.0 2023-06-24 05:30:50,794 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1031718.0, ans=0.0 2023-06-24 05:31:21,432 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1031778.0, ans=0.125 2023-06-24 05:31:29,137 INFO [train.py:996] (0/4) Epoch 6, batch 19500, loss[loss=0.2813, simple_loss=0.3345, pruned_loss=0.114, over 21384.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.2927, pruned_loss=0.07219, over 4295953.87 frames. ], batch size: 507, lr: 5.04e-03, grad_scale: 16.0 2023-06-24 05:32:16,300 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-172000.pt 2023-06-24 05:32:29,172 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1032018.0, ans=0.125 2023-06-24 05:32:30,836 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1032018.0, ans=0.04949747468305833 2023-06-24 05:33:07,787 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 05:33:17,783 INFO [train.py:996] (0/4) Epoch 6, batch 19550, loss[loss=0.2264, simple_loss=0.3117, pruned_loss=0.0705, over 21508.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2898, pruned_loss=0.07137, over 4293227.16 frames. ], batch size: 471, lr: 5.03e-03, grad_scale: 16.0 2023-06-24 05:33:23,472 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1032138.0, ans=0.125 2023-06-24 05:33:37,934 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.061e+02 2.721e+02 3.147e+02 3.714e+02 5.540e+02, threshold=6.293e+02, percent-clipped=0.0 2023-06-24 05:33:56,074 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1032258.0, ans=0.125 2023-06-24 05:33:57,691 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1032258.0, ans=0.09899494936611666 2023-06-24 05:34:52,056 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.66 vs. limit=15.0 2023-06-24 05:35:04,145 INFO [train.py:996] (0/4) Epoch 6, batch 19600, loss[loss=0.2221, simple_loss=0.2845, pruned_loss=0.07984, over 21226.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2908, pruned_loss=0.07181, over 4298927.53 frames. ], batch size: 159, lr: 5.03e-03, grad_scale: 32.0 2023-06-24 05:35:22,798 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.34 vs. limit=15.0 2023-06-24 05:35:28,112 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.67 vs. limit=10.0 2023-06-24 05:36:00,747 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1032618.0, ans=0.1 2023-06-24 05:36:06,202 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1032618.0, ans=0.125 2023-06-24 05:36:07,952 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1032618.0, ans=0.0 2023-06-24 05:36:53,286 INFO [train.py:996] (0/4) Epoch 6, batch 19650, loss[loss=0.2351, simple_loss=0.3074, pruned_loss=0.08139, over 21699.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.2962, pruned_loss=0.07555, over 4304590.94 frames. ], batch size: 298, lr: 5.03e-03, grad_scale: 16.0 2023-06-24 05:37:07,948 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1032738.0, ans=0.1 2023-06-24 05:37:09,796 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1032798.0, ans=0.0 2023-06-24 05:37:16,216 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.065e+02 2.599e+02 2.881e+02 3.237e+02 5.731e+02, threshold=5.762e+02, percent-clipped=0.0 2023-06-24 05:37:17,407 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.90 vs. limit=15.0 2023-06-24 05:38:45,140 INFO [train.py:996] (0/4) Epoch 6, batch 19700, loss[loss=0.1917, simple_loss=0.2767, pruned_loss=0.05333, over 21584.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.2988, pruned_loss=0.07598, over 4297515.34 frames. ], batch size: 230, lr: 5.03e-03, grad_scale: 16.0 2023-06-24 05:39:01,624 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.17 vs. limit=22.5 2023-06-24 05:39:12,132 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1033098.0, ans=0.125 2023-06-24 05:39:56,611 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1033158.0, ans=0.125 2023-06-24 05:40:30,993 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1033278.0, ans=0.125 2023-06-24 05:40:35,428 INFO [train.py:996] (0/4) Epoch 6, batch 19750, loss[loss=0.2395, simple_loss=0.3326, pruned_loss=0.07318, over 21617.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3085, pruned_loss=0.07763, over 4293594.35 frames. ], batch size: 230, lr: 5.03e-03, grad_scale: 8.0 2023-06-24 05:41:09,214 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.111e+02 2.723e+02 3.338e+02 4.190e+02 5.879e+02, threshold=6.676e+02, percent-clipped=1.0 2023-06-24 05:41:20,024 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1033458.0, ans=0.125 2023-06-24 05:41:21,714 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1033458.0, ans=0.2 2023-06-24 05:41:48,604 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1033518.0, ans=0.125 2023-06-24 05:41:53,764 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1033518.0, ans=0.95 2023-06-24 05:42:22,730 INFO [train.py:996] (0/4) Epoch 6, batch 19800, loss[loss=0.2034, simple_loss=0.2844, pruned_loss=0.0612, over 21776.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3066, pruned_loss=0.07743, over 4293906.68 frames. ], batch size: 332, lr: 5.03e-03, grad_scale: 8.0 2023-06-24 05:42:56,766 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1033698.0, ans=0.2 2023-06-24 05:43:37,694 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1033818.0, ans=0.125 2023-06-24 05:43:54,195 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1033878.0, ans=0.0 2023-06-24 05:44:04,703 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1033878.0, ans=0.2 2023-06-24 05:44:17,977 INFO [train.py:996] (0/4) Epoch 6, batch 19850, loss[loss=0.1477, simple_loss=0.1973, pruned_loss=0.04904, over 16321.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.2994, pruned_loss=0.07292, over 4276549.74 frames. ], batch size: 60, lr: 5.03e-03, grad_scale: 8.0 2023-06-24 05:44:18,652 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1033938.0, ans=0.0 2023-06-24 05:44:33,696 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1033938.0, ans=0.125 2023-06-24 05:44:52,061 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.808e+02 2.320e+02 2.642e+02 2.979e+02 5.130e+02, threshold=5.285e+02, percent-clipped=0.0 2023-06-24 05:45:31,871 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1034118.0, ans=0.1 2023-06-24 05:45:43,303 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=1034178.0, ans=0.05 2023-06-24 05:45:59,002 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1034178.0, ans=0.2 2023-06-24 05:46:03,663 INFO [train.py:996] (0/4) Epoch 6, batch 19900, loss[loss=0.2024, simple_loss=0.2938, pruned_loss=0.05552, over 21456.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.301, pruned_loss=0.07057, over 4277430.69 frames. ], batch size: 211, lr: 5.03e-03, grad_scale: 8.0 2023-06-24 05:46:04,570 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1034238.0, ans=0.1 2023-06-24 05:47:04,345 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1034358.0, ans=0.2 2023-06-24 05:47:30,904 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1034478.0, ans=0.1 2023-06-24 05:47:58,285 INFO [train.py:996] (0/4) Epoch 6, batch 19950, loss[loss=0.1971, simple_loss=0.2794, pruned_loss=0.05736, over 21836.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.2957, pruned_loss=0.0704, over 4271689.96 frames. ], batch size: 372, lr: 5.03e-03, grad_scale: 8.0 2023-06-24 05:48:33,577 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.795e+02 2.312e+02 2.767e+02 3.263e+02 6.271e+02, threshold=5.533e+02, percent-clipped=3.0 2023-06-24 05:49:33,515 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1034778.0, ans=0.125 2023-06-24 05:49:46,367 INFO [train.py:996] (0/4) Epoch 6, batch 20000, loss[loss=0.2148, simple_loss=0.2884, pruned_loss=0.07059, over 21356.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.2975, pruned_loss=0.07099, over 4265113.13 frames. ], batch size: 144, lr: 5.03e-03, grad_scale: 16.0 2023-06-24 05:49:55,350 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1034838.0, ans=0.1 2023-06-24 05:50:57,080 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.95 vs. limit=15.0 2023-06-24 05:51:33,388 INFO [train.py:996] (0/4) Epoch 6, batch 20050, loss[loss=0.2299, simple_loss=0.3009, pruned_loss=0.07939, over 21882.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.2988, pruned_loss=0.07297, over 4275149.97 frames. ], batch size: 118, lr: 5.03e-03, grad_scale: 16.0 2023-06-24 05:52:02,726 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.17 vs. limit=15.0 2023-06-24 05:52:03,805 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1035198.0, ans=0.04949747468305833 2023-06-24 05:52:08,124 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.887e+02 2.657e+02 2.915e+02 3.243e+02 4.793e+02, threshold=5.831e+02, percent-clipped=0.0 2023-06-24 05:52:19,956 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.32 vs. limit=6.0 2023-06-24 05:52:21,338 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1035258.0, ans=0.125 2023-06-24 05:52:24,532 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1035258.0, ans=0.035 2023-06-24 05:52:51,783 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1035318.0, ans=0.1 2023-06-24 05:53:23,694 INFO [train.py:996] (0/4) Epoch 6, batch 20100, loss[loss=0.2056, simple_loss=0.2642, pruned_loss=0.07346, over 21256.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.3007, pruned_loss=0.07554, over 4274574.67 frames. ], batch size: 608, lr: 5.03e-03, grad_scale: 16.0 2023-06-24 05:53:45,570 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1035438.0, ans=0.125 2023-06-24 05:54:09,431 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.77 vs. limit=22.5 2023-06-24 05:54:30,100 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1035618.0, ans=0.0 2023-06-24 05:55:18,930 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1035738.0, ans=0.0 2023-06-24 05:55:20,095 INFO [train.py:996] (0/4) Epoch 6, batch 20150, loss[loss=0.2658, simple_loss=0.3385, pruned_loss=0.09654, over 21355.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3093, pruned_loss=0.07872, over 4275506.14 frames. ], batch size: 159, lr: 5.03e-03, grad_scale: 16.0 2023-06-24 05:55:22,705 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1035738.0, ans=0.125 2023-06-24 05:55:46,197 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.227e+02 2.880e+02 3.455e+02 4.017e+02 7.640e+02, threshold=6.911e+02, percent-clipped=4.0 2023-06-24 05:56:21,044 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1035858.0, ans=0.0 2023-06-24 05:56:22,816 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1035858.0, ans=0.0 2023-06-24 05:56:53,841 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1035978.0, ans=0.0 2023-06-24 05:57:12,802 INFO [train.py:996] (0/4) Epoch 6, batch 20200, loss[loss=0.2318, simple_loss=0.3132, pruned_loss=0.07523, over 21765.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3152, pruned_loss=0.08148, over 4271189.68 frames. ], batch size: 282, lr: 5.02e-03, grad_scale: 16.0 2023-06-24 05:57:20,860 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1036038.0, ans=0.0 2023-06-24 05:57:22,510 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1036038.0, ans=0.125 2023-06-24 05:58:52,550 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1036278.0, ans=0.0 2023-06-24 05:58:59,061 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1036278.0, ans=0.2 2023-06-24 05:59:01,885 INFO [train.py:996] (0/4) Epoch 6, batch 20250, loss[loss=0.2095, simple_loss=0.2902, pruned_loss=0.06442, over 21385.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.3154, pruned_loss=0.08014, over 4279358.73 frames. ], batch size: 176, lr: 5.02e-03, grad_scale: 16.0 2023-06-24 05:59:26,781 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.892e+02 2.473e+02 2.856e+02 3.579e+02 8.091e+02, threshold=5.711e+02, percent-clipped=1.0 2023-06-24 05:59:36,607 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1036398.0, ans=0.1 2023-06-24 06:00:49,985 INFO [train.py:996] (0/4) Epoch 6, batch 20300, loss[loss=0.1877, simple_loss=0.2643, pruned_loss=0.05558, over 21909.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3134, pruned_loss=0.07748, over 4282187.53 frames. ], batch size: 98, lr: 5.02e-03, grad_scale: 16.0 2023-06-24 06:00:53,795 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1036638.0, ans=0.125 2023-06-24 06:01:14,470 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1036698.0, ans=0.2 2023-06-24 06:01:23,405 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1036758.0, ans=0.0 2023-06-24 06:02:18,840 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1036878.0, ans=0.1 2023-06-24 06:02:33,236 INFO [train.py:996] (0/4) Epoch 6, batch 20350, loss[loss=0.2502, simple_loss=0.3174, pruned_loss=0.09149, over 21463.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.3139, pruned_loss=0.07819, over 4279908.85 frames. ], batch size: 548, lr: 5.02e-03, grad_scale: 16.0 2023-06-24 06:02:49,845 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.12 vs. limit=6.0 2023-06-24 06:02:56,832 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.914e+02 2.320e+02 2.555e+02 2.973e+02 6.061e+02, threshold=5.110e+02, percent-clipped=1.0 2023-06-24 06:04:03,717 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1037178.0, ans=0.125 2023-06-24 06:04:04,245 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.61 vs. limit=15.0 2023-06-24 06:04:20,809 INFO [train.py:996] (0/4) Epoch 6, batch 20400, loss[loss=0.2574, simple_loss=0.3296, pruned_loss=0.09264, over 21852.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3155, pruned_loss=0.08008, over 4271400.61 frames. ], batch size: 118, lr: 5.02e-03, grad_scale: 32.0 2023-06-24 06:04:21,484 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1037238.0, ans=0.0 2023-06-24 06:04:56,454 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1037298.0, ans=0.0 2023-06-24 06:05:28,978 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1037418.0, ans=0.0 2023-06-24 06:05:41,589 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1037418.0, ans=0.125 2023-06-24 06:05:45,236 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1037418.0, ans=0.1 2023-06-24 06:05:50,217 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1037478.0, ans=0.2 2023-06-24 06:05:57,107 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1037478.0, ans=0.125 2023-06-24 06:06:08,182 INFO [train.py:996] (0/4) Epoch 6, batch 20450, loss[loss=0.2278, simple_loss=0.2961, pruned_loss=0.07972, over 21575.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3161, pruned_loss=0.08184, over 4259690.70 frames. ], batch size: 548, lr: 5.02e-03, grad_scale: 32.0 2023-06-24 06:06:10,645 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1037538.0, ans=0.05 2023-06-24 06:06:31,829 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.242e+02 2.916e+02 3.328e+02 3.687e+02 5.878e+02, threshold=6.655e+02, percent-clipped=5.0 2023-06-24 06:07:54,287 INFO [train.py:996] (0/4) Epoch 6, batch 20500, loss[loss=0.1954, simple_loss=0.268, pruned_loss=0.06136, over 16687.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3114, pruned_loss=0.08166, over 4265679.84 frames. ], batch size: 62, lr: 5.02e-03, grad_scale: 32.0 2023-06-24 06:09:28,437 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1038078.0, ans=0.015 2023-06-24 06:09:32,164 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1038078.0, ans=0.125 2023-06-24 06:09:35,644 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1038078.0, ans=0.2 2023-06-24 06:09:41,879 INFO [train.py:996] (0/4) Epoch 6, batch 20550, loss[loss=0.2959, simple_loss=0.3576, pruned_loss=0.1171, over 21409.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3049, pruned_loss=0.08051, over 4259369.84 frames. ], batch size: 508, lr: 5.02e-03, grad_scale: 32.0 2023-06-24 06:10:06,165 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.939e+02 2.636e+02 3.017e+02 3.648e+02 5.396e+02, threshold=6.035e+02, percent-clipped=0.0 2023-06-24 06:10:35,994 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1038258.0, ans=0.2 2023-06-24 06:10:41,089 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1038258.0, ans=0.95 2023-06-24 06:10:53,566 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1038318.0, ans=0.025 2023-06-24 06:11:14,616 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1038378.0, ans=0.09899494936611666 2023-06-24 06:11:29,767 INFO [train.py:996] (0/4) Epoch 6, batch 20600, loss[loss=0.2702, simple_loss=0.3359, pruned_loss=0.1022, over 21537.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3096, pruned_loss=0.07909, over 4249521.28 frames. ], batch size: 471, lr: 5.02e-03, grad_scale: 32.0 2023-06-24 06:12:01,714 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 06:12:29,268 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1038558.0, ans=0.0 2023-06-24 06:12:42,465 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1038618.0, ans=0.125 2023-06-24 06:12:47,708 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1038618.0, ans=0.125 2023-06-24 06:13:10,692 INFO [train.py:996] (0/4) Epoch 6, batch 20650, loss[loss=0.2243, simple_loss=0.2843, pruned_loss=0.08214, over 21714.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3035, pruned_loss=0.07843, over 4241116.49 frames. ], batch size: 415, lr: 5.02e-03, grad_scale: 32.0 2023-06-24 06:13:13,165 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1038738.0, ans=0.1 2023-06-24 06:13:13,278 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1038738.0, ans=0.2 2023-06-24 06:13:20,547 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.38 vs. limit=15.0 2023-06-24 06:13:40,720 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.859e+02 2.453e+02 2.852e+02 3.486e+02 6.346e+02, threshold=5.704e+02, percent-clipped=1.0 2023-06-24 06:14:23,654 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1038918.0, ans=0.1 2023-06-24 06:14:29,051 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1038918.0, ans=0.0 2023-06-24 06:14:43,207 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1038978.0, ans=0.0 2023-06-24 06:14:52,714 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.51 vs. limit=15.0 2023-06-24 06:14:53,833 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1038978.0, ans=0.07 2023-06-24 06:15:00,373 INFO [train.py:996] (0/4) Epoch 6, batch 20700, loss[loss=0.1737, simple_loss=0.2436, pruned_loss=0.05191, over 20766.00 frames. ], tot_loss[loss=0.222, simple_loss=0.2946, pruned_loss=0.07464, over 4248474.14 frames. ], batch size: 608, lr: 5.02e-03, grad_scale: 16.0 2023-06-24 06:15:27,225 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1039098.0, ans=0.125 2023-06-24 06:15:29,118 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1039098.0, ans=0.0 2023-06-24 06:15:42,930 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1039098.0, ans=0.0 2023-06-24 06:16:49,949 INFO [train.py:996] (0/4) Epoch 6, batch 20750, loss[loss=0.195, simple_loss=0.2663, pruned_loss=0.06185, over 21446.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.2971, pruned_loss=0.07417, over 4257762.97 frames. ], batch size: 211, lr: 5.02e-03, grad_scale: 16.0 2023-06-24 06:17:37,394 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.847e+02 2.434e+02 2.945e+02 4.112e+02 9.661e+02, threshold=5.891e+02, percent-clipped=8.0 2023-06-24 06:17:58,940 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1039458.0, ans=0.0 2023-06-24 06:18:16,033 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 06:18:31,723 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1039578.0, ans=0.1 2023-06-24 06:18:43,219 INFO [train.py:996] (0/4) Epoch 6, batch 20800, loss[loss=0.2222, simple_loss=0.2914, pruned_loss=0.07648, over 21746.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.3013, pruned_loss=0.07527, over 4263600.34 frames. ], batch size: 351, lr: 5.02e-03, grad_scale: 32.0 2023-06-24 06:18:45,513 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1039638.0, ans=0.125 2023-06-24 06:18:57,424 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 06:19:55,813 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1039818.0, ans=0.125 2023-06-24 06:19:55,873 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1039818.0, ans=0.125 2023-06-24 06:20:01,002 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1039818.0, ans=0.0 2023-06-24 06:20:02,473 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1039818.0, ans=0.125 2023-06-24 06:20:15,417 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.49 vs. limit=15.0 2023-06-24 06:20:29,171 INFO [train.py:996] (0/4) Epoch 6, batch 20850, loss[loss=0.1846, simple_loss=0.2561, pruned_loss=0.05662, over 21543.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.2955, pruned_loss=0.07373, over 4265947.95 frames. ], batch size: 212, lr: 5.02e-03, grad_scale: 32.0 2023-06-24 06:20:46,829 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1039938.0, ans=0.0 2023-06-24 06:20:59,839 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1039998.0, ans=0.0 2023-06-24 06:20:59,873 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1039998.0, ans=0.04949747468305833 2023-06-24 06:20:59,919 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1039998.0, ans=10.0 2023-06-24 06:21:06,160 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.891e+02 2.402e+02 2.795e+02 3.449e+02 6.931e+02, threshold=5.589e+02, percent-clipped=4.0 2023-06-24 06:21:28,584 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1040058.0, ans=0.09899494936611666 2023-06-24 06:21:39,452 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1040118.0, ans=0.125 2023-06-24 06:22:18,913 INFO [train.py:996] (0/4) Epoch 6, batch 20900, loss[loss=0.2276, simple_loss=0.3038, pruned_loss=0.07565, over 21548.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.2968, pruned_loss=0.07504, over 4268807.02 frames. ], batch size: 195, lr: 5.01e-03, grad_scale: 32.0 2023-06-24 06:22:22,827 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 06:22:54,043 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 06:23:01,721 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.77 vs. limit=15.0 2023-06-24 06:23:19,879 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1040358.0, ans=0.0 2023-06-24 06:23:43,789 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1040478.0, ans=0.0 2023-06-24 06:24:04,697 INFO [train.py:996] (0/4) Epoch 6, batch 20950, loss[loss=0.1882, simple_loss=0.2615, pruned_loss=0.05744, over 21589.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.292, pruned_loss=0.07171, over 4270313.87 frames. ], batch size: 230, lr: 5.01e-03, grad_scale: 32.0 2023-06-24 06:24:05,081 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1040538.0, ans=0.125 2023-06-24 06:24:10,247 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1040538.0, ans=0.0 2023-06-24 06:24:14,154 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.39 vs. limit=22.5 2023-06-24 06:24:25,317 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1040598.0, ans=0.2 2023-06-24 06:24:40,112 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.703e+02 2.258e+02 2.758e+02 3.294e+02 6.843e+02, threshold=5.516e+02, percent-clipped=1.0 2023-06-24 06:25:12,600 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.99 vs. limit=15.0 2023-06-24 06:25:13,580 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1040718.0, ans=0.0 2023-06-24 06:25:20,938 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=6.36 vs. limit=22.5 2023-06-24 06:25:27,200 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1040778.0, ans=0.0 2023-06-24 06:25:41,712 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.53 vs. limit=15.0 2023-06-24 06:25:50,770 INFO [train.py:996] (0/4) Epoch 6, batch 21000, loss[loss=0.2061, simple_loss=0.2681, pruned_loss=0.07208, over 21386.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2901, pruned_loss=0.07218, over 4275394.78 frames. ], batch size: 176, lr: 5.01e-03, grad_scale: 32.0 2023-06-24 06:25:50,772 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-24 06:26:08,832 INFO [train.py:1028] (0/4) Epoch 6, validation: loss=0.2672, simple_loss=0.3654, pruned_loss=0.08451, over 1796401.00 frames. 2023-06-24 06:26:08,833 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23616MB 2023-06-24 06:26:49,643 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.72 vs. limit=15.0 2023-06-24 06:27:17,435 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1041018.0, ans=0.125 2023-06-24 06:27:46,152 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1041078.0, ans=0.1 2023-06-24 06:27:50,626 INFO [train.py:996] (0/4) Epoch 6, batch 21050, loss[loss=0.2107, simple_loss=0.2753, pruned_loss=0.07307, over 22004.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2881, pruned_loss=0.07225, over 4268773.19 frames. ], batch size: 103, lr: 5.01e-03, grad_scale: 16.0 2023-06-24 06:27:57,988 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1041138.0, ans=0.125 2023-06-24 06:28:13,739 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1041198.0, ans=0.125 2023-06-24 06:28:23,279 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.974e+02 2.469e+02 2.621e+02 3.007e+02 4.225e+02, threshold=5.242e+02, percent-clipped=0.0 2023-06-24 06:28:34,713 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1041258.0, ans=0.1 2023-06-24 06:29:02,418 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1041318.0, ans=0.0 2023-06-24 06:29:32,190 INFO [train.py:996] (0/4) Epoch 6, batch 21100, loss[loss=0.2011, simple_loss=0.2661, pruned_loss=0.06801, over 21245.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.285, pruned_loss=0.07234, over 4272183.43 frames. ], batch size: 176, lr: 5.01e-03, grad_scale: 16.0 2023-06-24 06:29:44,082 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.14 vs. limit=6.0 2023-06-24 06:29:53,563 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1041438.0, ans=0.125 2023-06-24 06:29:58,933 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1041498.0, ans=0.0 2023-06-24 06:30:17,089 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1041498.0, ans=0.0 2023-06-24 06:30:34,124 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1041558.0, ans=0.0 2023-06-24 06:31:08,870 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 06:31:10,618 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1041678.0, ans=0.1 2023-06-24 06:31:20,309 INFO [train.py:996] (0/4) Epoch 6, batch 21150, loss[loss=0.2099, simple_loss=0.2762, pruned_loss=0.07182, over 21844.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2816, pruned_loss=0.07237, over 4269435.28 frames. ], batch size: 107, lr: 5.01e-03, grad_scale: 16.0 2023-06-24 06:31:24,200 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1041738.0, ans=0.125 2023-06-24 06:32:03,769 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.773e+02 2.519e+02 2.928e+02 4.378e+02 7.241e+02, threshold=5.856e+02, percent-clipped=12.0 2023-06-24 06:32:26,028 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1041918.0, ans=0.2 2023-06-24 06:32:34,581 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1041918.0, ans=0.125 2023-06-24 06:32:36,386 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 06:32:53,360 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1041978.0, ans=0.125 2023-06-24 06:33:01,391 INFO [train.py:996] (0/4) Epoch 6, batch 21200, loss[loss=0.1806, simple_loss=0.2499, pruned_loss=0.05564, over 21463.00 frames. ], tot_loss[loss=0.2108, simple_loss=0.2788, pruned_loss=0.0714, over 4270371.84 frames. ], batch size: 131, lr: 5.01e-03, grad_scale: 16.0 2023-06-24 06:34:17,190 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1042218.0, ans=0.125 2023-06-24 06:34:49,533 INFO [train.py:996] (0/4) Epoch 6, batch 21250, loss[loss=0.1957, simple_loss=0.2696, pruned_loss=0.06093, over 21629.00 frames. ], tot_loss[loss=0.21, simple_loss=0.2774, pruned_loss=0.07133, over 4258130.51 frames. ], batch size: 263, lr: 5.01e-03, grad_scale: 16.0 2023-06-24 06:35:33,818 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.037e+02 2.602e+02 2.917e+02 3.308e+02 4.858e+02, threshold=5.834e+02, percent-clipped=0.0 2023-06-24 06:36:13,620 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.74 vs. limit=15.0 2023-06-24 06:36:13,820 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.87 vs. limit=15.0 2023-06-24 06:36:36,716 INFO [train.py:996] (0/4) Epoch 6, batch 21300, loss[loss=0.2141, simple_loss=0.2948, pruned_loss=0.06675, over 21697.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2841, pruned_loss=0.07327, over 4258756.42 frames. ], batch size: 247, lr: 5.01e-03, grad_scale: 16.0 2023-06-24 06:37:35,970 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1042758.0, ans=0.0 2023-06-24 06:37:46,234 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1042758.0, ans=0.125 2023-06-24 06:37:53,752 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.03 vs. limit=6.0 2023-06-24 06:38:28,063 INFO [train.py:996] (0/4) Epoch 6, batch 21350, loss[loss=0.2488, simple_loss=0.3422, pruned_loss=0.07771, over 19822.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2875, pruned_loss=0.07343, over 4265188.61 frames. ], batch size: 704, lr: 5.01e-03, grad_scale: 16.0 2023-06-24 06:38:59,527 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1042998.0, ans=0.0 2023-06-24 06:39:13,497 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.957e+02 2.459e+02 2.698e+02 3.098e+02 4.551e+02, threshold=5.397e+02, percent-clipped=0.0 2023-06-24 06:39:23,453 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1043058.0, ans=0.125 2023-06-24 06:39:25,564 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.06 vs. limit=15.0 2023-06-24 06:40:27,149 INFO [train.py:996] (0/4) Epoch 6, batch 21400, loss[loss=0.2949, simple_loss=0.357, pruned_loss=0.1164, over 21450.00 frames. ], tot_loss[loss=0.219, simple_loss=0.2909, pruned_loss=0.07358, over 4270836.64 frames. ], batch size: 471, lr: 5.01e-03, grad_scale: 16.0 2023-06-24 06:41:23,254 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.77 vs. limit=15.0 2023-06-24 06:41:28,416 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.74 vs. limit=12.0 2023-06-24 06:41:37,781 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1043418.0, ans=0.125 2023-06-24 06:41:54,885 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1043478.0, ans=0.125 2023-06-24 06:42:15,535 INFO [train.py:996] (0/4) Epoch 6, batch 21450, loss[loss=0.2129, simple_loss=0.2871, pruned_loss=0.06938, over 21873.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.2942, pruned_loss=0.07469, over 4276331.72 frames. ], batch size: 371, lr: 5.01e-03, grad_scale: 16.0 2023-06-24 06:42:26,097 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1043538.0, ans=0.1 2023-06-24 06:42:49,430 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.972e+02 2.545e+02 3.012e+02 3.537e+02 6.506e+02, threshold=6.024e+02, percent-clipped=2.0 2023-06-24 06:43:07,569 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1043658.0, ans=0.125 2023-06-24 06:44:02,131 INFO [train.py:996] (0/4) Epoch 6, batch 21500, loss[loss=0.2124, simple_loss=0.2676, pruned_loss=0.07862, over 21241.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.2931, pruned_loss=0.0759, over 4269101.46 frames. ], batch size: 143, lr: 5.01e-03, grad_scale: 16.0 2023-06-24 06:44:17,645 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.39 vs. limit=15.0 2023-06-24 06:45:14,964 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1044018.0, ans=0.125 2023-06-24 06:45:50,207 INFO [train.py:996] (0/4) Epoch 6, batch 21550, loss[loss=0.2067, simple_loss=0.2698, pruned_loss=0.0718, over 21597.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.286, pruned_loss=0.0725, over 4268018.86 frames. ], batch size: 391, lr: 5.01e-03, grad_scale: 8.0 2023-06-24 06:46:26,729 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.014e+02 2.541e+02 2.913e+02 3.487e+02 5.320e+02, threshold=5.826e+02, percent-clipped=0.0 2023-06-24 06:47:17,264 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.27 vs. limit=22.5 2023-06-24 06:47:39,509 INFO [train.py:996] (0/4) Epoch 6, batch 21600, loss[loss=0.1733, simple_loss=0.2293, pruned_loss=0.0586, over 20749.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.2805, pruned_loss=0.07124, over 4272037.56 frames. ], batch size: 608, lr: 5.00e-03, grad_scale: 16.0 2023-06-24 06:48:16,617 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.67 vs. limit=12.0 2023-06-24 06:48:44,248 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1044618.0, ans=0.1 2023-06-24 06:49:26,530 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1044738.0, ans=0.0 2023-06-24 06:49:26,536 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1044738.0, ans=0.1 2023-06-24 06:49:27,627 INFO [train.py:996] (0/4) Epoch 6, batch 21650, loss[loss=0.1887, simple_loss=0.2782, pruned_loss=0.04957, over 21849.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2855, pruned_loss=0.07007, over 4270927.49 frames. ], batch size: 118, lr: 5.00e-03, grad_scale: 16.0 2023-06-24 06:49:34,919 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1044738.0, ans=0.125 2023-06-24 06:49:36,779 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1044738.0, ans=0.035 2023-06-24 06:49:49,066 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1044798.0, ans=0.2 2023-06-24 06:50:03,881 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.981e+02 2.515e+02 2.797e+02 3.244e+02 5.540e+02, threshold=5.595e+02, percent-clipped=0.0 2023-06-24 06:50:52,737 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1044978.0, ans=0.1 2023-06-24 06:51:14,470 INFO [train.py:996] (0/4) Epoch 6, batch 21700, loss[loss=0.1761, simple_loss=0.2567, pruned_loss=0.04775, over 21497.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2859, pruned_loss=0.06813, over 4267593.22 frames. ], batch size: 211, lr: 5.00e-03, grad_scale: 16.0 2023-06-24 06:51:23,684 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1045038.0, ans=0.04949747468305833 2023-06-24 06:51:37,804 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1045098.0, ans=0.0 2023-06-24 06:52:42,060 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1045278.0, ans=0.125 2023-06-24 06:53:01,228 INFO [train.py:996] (0/4) Epoch 6, batch 21750, loss[loss=0.2312, simple_loss=0.2771, pruned_loss=0.09262, over 21211.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.2819, pruned_loss=0.06819, over 4256074.97 frames. ], batch size: 471, lr: 5.00e-03, grad_scale: 16.0 2023-06-24 06:53:06,081 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.51 vs. limit=12.0 2023-06-24 06:53:06,188 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.39 vs. limit=15.0 2023-06-24 06:53:37,706 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.726e+02 2.476e+02 2.744e+02 3.259e+02 4.826e+02, threshold=5.488e+02, percent-clipped=0.0 2023-06-24 06:53:45,382 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1045458.0, ans=0.125 2023-06-24 06:53:52,131 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1045458.0, ans=0.125 2023-06-24 06:53:59,030 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1045518.0, ans=0.2 2023-06-24 06:54:47,133 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1045578.0, ans=0.125 2023-06-24 06:54:49,972 INFO [train.py:996] (0/4) Epoch 6, batch 21800, loss[loss=0.2172, simple_loss=0.2872, pruned_loss=0.0736, over 21483.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.2805, pruned_loss=0.06894, over 4255712.48 frames. ], batch size: 212, lr: 5.00e-03, grad_scale: 16.0 2023-06-24 06:55:19,730 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1045698.0, ans=0.0 2023-06-24 06:55:21,472 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1045698.0, ans=0.125 2023-06-24 06:55:33,777 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1045758.0, ans=0.09899494936611666 2023-06-24 06:56:00,041 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1045818.0, ans=0.2 2023-06-24 06:56:01,676 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1045818.0, ans=0.125 2023-06-24 06:56:39,645 INFO [train.py:996] (0/4) Epoch 6, batch 21850, loss[loss=0.2132, simple_loss=0.293, pruned_loss=0.06669, over 21448.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2866, pruned_loss=0.07018, over 4253881.31 frames. ], batch size: 211, lr: 5.00e-03, grad_scale: 16.0 2023-06-24 06:56:40,811 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.59 vs. limit=6.0 2023-06-24 06:57:14,014 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1045998.0, ans=0.09899494936611666 2023-06-24 06:57:16,749 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.124e+02 2.510e+02 2.889e+02 3.463e+02 5.314e+02, threshold=5.778e+02, percent-clipped=0.0 2023-06-24 06:57:47,173 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1046118.0, ans=0.125 2023-06-24 06:58:18,389 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1046178.0, ans=0.0 2023-06-24 06:58:21,694 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1046178.0, ans=0.1 2023-06-24 06:58:27,739 INFO [train.py:996] (0/4) Epoch 6, batch 21900, loss[loss=0.1933, simple_loss=0.2606, pruned_loss=0.063, over 21659.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2891, pruned_loss=0.07117, over 4258047.67 frames. ], batch size: 282, lr: 5.00e-03, grad_scale: 16.0 2023-06-24 06:58:35,971 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1046238.0, ans=0.0 2023-06-24 06:59:45,552 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.84 vs. limit=15.0 2023-06-24 07:00:20,085 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.36 vs. limit=8.0 2023-06-24 07:00:21,989 INFO [train.py:996] (0/4) Epoch 6, batch 21950, loss[loss=0.1846, simple_loss=0.253, pruned_loss=0.05806, over 21576.00 frames. ], tot_loss[loss=0.2123, simple_loss=0.2838, pruned_loss=0.07035, over 4261422.65 frames. ], batch size: 298, lr: 5.00e-03, grad_scale: 16.0 2023-06-24 07:00:24,596 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1046538.0, ans=0.125 2023-06-24 07:00:45,400 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1046598.0, ans=0.0 2023-06-24 07:00:53,153 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.690e+02 2.424e+02 2.913e+02 3.468e+02 5.833e+02, threshold=5.826e+02, percent-clipped=1.0 2023-06-24 07:01:47,661 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1046778.0, ans=0.125 2023-06-24 07:02:10,404 INFO [train.py:996] (0/4) Epoch 6, batch 22000, loss[loss=0.1638, simple_loss=0.2383, pruned_loss=0.04461, over 21485.00 frames. ], tot_loss[loss=0.2049, simple_loss=0.2763, pruned_loss=0.06676, over 4261361.51 frames. ], batch size: 230, lr: 5.00e-03, grad_scale: 32.0 2023-06-24 07:02:19,637 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1046838.0, ans=0.125 2023-06-24 07:02:53,942 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1046958.0, ans=0.2 2023-06-24 07:03:33,424 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.46 vs. limit=22.5 2023-06-24 07:03:36,446 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1047078.0, ans=0.0 2023-06-24 07:03:50,990 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1047078.0, ans=0.125 2023-06-24 07:04:00,834 INFO [train.py:996] (0/4) Epoch 6, batch 22050, loss[loss=0.3433, simple_loss=0.4003, pruned_loss=0.1432, over 21407.00 frames. ], tot_loss[loss=0.21, simple_loss=0.2818, pruned_loss=0.06912, over 4257239.82 frames. ], batch size: 507, lr: 5.00e-03, grad_scale: 32.0 2023-06-24 07:04:07,360 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.14 vs. limit=10.0 2023-06-24 07:04:39,713 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.688e+02 2.375e+02 2.787e+02 3.407e+02 5.897e+02, threshold=5.574e+02, percent-clipped=1.0 2023-06-24 07:04:49,400 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.05 vs. limit=12.0 2023-06-24 07:04:51,205 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.17 vs. limit=6.0 2023-06-24 07:05:30,902 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1047378.0, ans=0.2 2023-06-24 07:05:39,593 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1047378.0, ans=0.1 2023-06-24 07:05:39,629 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1047378.0, ans=0.09899494936611666 2023-06-24 07:05:49,357 INFO [train.py:996] (0/4) Epoch 6, batch 22100, loss[loss=0.2403, simple_loss=0.3176, pruned_loss=0.08144, over 21764.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.2922, pruned_loss=0.07306, over 4249752.29 frames. ], batch size: 247, lr: 5.00e-03, grad_scale: 16.0 2023-06-24 07:06:26,189 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1047498.0, ans=0.0 2023-06-24 07:07:32,048 INFO [train.py:996] (0/4) Epoch 6, batch 22150, loss[loss=0.2335, simple_loss=0.3048, pruned_loss=0.08108, over 21862.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.2953, pruned_loss=0.07497, over 4254674.33 frames. ], batch size: 371, lr: 5.00e-03, grad_scale: 16.0 2023-06-24 07:07:32,806 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1047738.0, ans=0.125 2023-06-24 07:08:10,538 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.155e+02 2.700e+02 3.228e+02 3.782e+02 5.741e+02, threshold=6.456e+02, percent-clipped=1.0 2023-06-24 07:08:13,028 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1047858.0, ans=0.2 2023-06-24 07:08:39,705 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.76 vs. limit=22.5 2023-06-24 07:08:59,153 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1047918.0, ans=0.1 2023-06-24 07:09:13,431 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.31 vs. limit=15.0 2023-06-24 07:09:21,419 INFO [train.py:996] (0/4) Epoch 6, batch 22200, loss[loss=0.2253, simple_loss=0.2822, pruned_loss=0.08416, over 21225.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.2974, pruned_loss=0.07658, over 4267246.69 frames. ], batch size: 608, lr: 5.00e-03, grad_scale: 16.0 2023-06-24 07:10:27,372 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1048218.0, ans=0.09899494936611666 2023-06-24 07:11:09,189 INFO [train.py:996] (0/4) Epoch 6, batch 22250, loss[loss=0.2333, simple_loss=0.3095, pruned_loss=0.07857, over 21499.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3039, pruned_loss=0.07746, over 4265233.26 frames. ], batch size: 211, lr: 5.00e-03, grad_scale: 16.0 2023-06-24 07:11:16,451 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1048338.0, ans=0.125 2023-06-24 07:11:35,543 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.02 vs. limit=12.0 2023-06-24 07:11:46,675 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.112e+02 2.521e+02 2.836e+02 3.368e+02 6.817e+02, threshold=5.671e+02, percent-clipped=1.0 2023-06-24 07:12:10,104 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1048458.0, ans=0.125 2023-06-24 07:12:13,127 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1048458.0, ans=0.125 2023-06-24 07:12:39,019 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=1048578.0, ans=10.0 2023-06-24 07:12:55,406 INFO [train.py:996] (0/4) Epoch 6, batch 22300, loss[loss=0.234, simple_loss=0.3042, pruned_loss=0.08189, over 21830.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3067, pruned_loss=0.07979, over 4271522.60 frames. ], batch size: 414, lr: 4.99e-03, grad_scale: 16.0 2023-06-24 07:13:34,427 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.47 vs. limit=15.0 2023-06-24 07:14:11,621 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.whiten.whitening_limit, batch_count=1048818.0, ans=12.0 2023-06-24 07:14:16,115 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1048818.0, ans=0.0 2023-06-24 07:14:23,178 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1048878.0, ans=0.125 2023-06-24 07:14:30,342 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 07:14:38,225 INFO [train.py:996] (0/4) Epoch 6, batch 22350, loss[loss=0.2335, simple_loss=0.3, pruned_loss=0.08343, over 21847.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3049, pruned_loss=0.08104, over 4282968.66 frames. ], batch size: 371, lr: 4.99e-03, grad_scale: 16.0 2023-06-24 07:14:48,443 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.00 vs. limit=15.0 2023-06-24 07:14:51,239 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1048938.0, ans=0.0 2023-06-24 07:15:15,679 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.270e+02 2.647e+02 2.993e+02 3.483e+02 5.422e+02, threshold=5.987e+02, percent-clipped=0.0 2023-06-24 07:15:23,409 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1049058.0, ans=10.0 2023-06-24 07:15:33,772 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1049058.0, ans=0.2 2023-06-24 07:15:48,113 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1049118.0, ans=0.125 2023-06-24 07:16:10,241 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1049178.0, ans=0.0 2023-06-24 07:16:20,245 INFO [train.py:996] (0/4) Epoch 6, batch 22400, loss[loss=0.1836, simple_loss=0.2776, pruned_loss=0.04482, over 20961.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3011, pruned_loss=0.07786, over 4273393.24 frames. ], batch size: 608, lr: 4.99e-03, grad_scale: 32.0 2023-06-24 07:16:20,754 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=1049238.0, ans=0.05 2023-06-24 07:16:27,710 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1049238.0, ans=0.125 2023-06-24 07:16:57,890 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1049298.0, ans=0.2 2023-06-24 07:17:13,075 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1049358.0, ans=0.125 2023-06-24 07:17:26,132 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1049358.0, ans=0.1 2023-06-24 07:17:36,764 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.64 vs. limit=15.0 2023-06-24 07:18:07,116 INFO [train.py:996] (0/4) Epoch 6, batch 22450, loss[loss=0.1667, simple_loss=0.2334, pruned_loss=0.05003, over 21585.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.2957, pruned_loss=0.07647, over 4268189.93 frames. ], batch size: 247, lr: 4.99e-03, grad_scale: 16.0 2023-06-24 07:18:07,578 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1049538.0, ans=0.1 2023-06-24 07:18:33,464 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1049598.0, ans=0.0 2023-06-24 07:18:44,611 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1049598.0, ans=0.0 2023-06-24 07:18:52,749 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.816e+02 2.523e+02 2.860e+02 3.590e+02 5.659e+02, threshold=5.720e+02, percent-clipped=0.0 2023-06-24 07:18:58,957 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1049658.0, ans=0.0 2023-06-24 07:19:10,078 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1049658.0, ans=0.0 2023-06-24 07:19:34,288 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1049778.0, ans=0.125 2023-06-24 07:19:35,913 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1049778.0, ans=0.125 2023-06-24 07:19:36,535 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.82 vs. limit=10.0 2023-06-24 07:19:39,939 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.08 vs. limit=15.0 2023-06-24 07:19:50,752 INFO [train.py:996] (0/4) Epoch 6, batch 22500, loss[loss=0.2104, simple_loss=0.2712, pruned_loss=0.07482, over 21554.00 frames. ], tot_loss[loss=0.222, simple_loss=0.2917, pruned_loss=0.07622, over 4268938.43 frames. ], batch size: 230, lr: 4.99e-03, grad_scale: 16.0 2023-06-24 07:20:38,017 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1049898.0, ans=0.125 2023-06-24 07:20:58,087 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1049958.0, ans=0.125 2023-06-24 07:21:03,269 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1050018.0, ans=0.1 2023-06-24 07:21:28,265 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1050078.0, ans=0.2 2023-06-24 07:21:30,014 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1050078.0, ans=0.0 2023-06-24 07:21:33,746 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1050078.0, ans=0.0 2023-06-24 07:21:40,284 INFO [train.py:996] (0/4) Epoch 6, batch 22550, loss[loss=0.1755, simple_loss=0.2579, pruned_loss=0.04656, over 16526.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.296, pruned_loss=0.07604, over 4273445.54 frames. ], batch size: 60, lr: 4.99e-03, grad_scale: 16.0 2023-06-24 07:22:32,442 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.214e+02 2.690e+02 3.328e+02 4.292e+02 7.428e+02, threshold=6.656e+02, percent-clipped=5.0 2023-06-24 07:23:03,859 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1050318.0, ans=0.0 2023-06-24 07:23:12,524 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1050378.0, ans=0.125 2023-06-24 07:23:25,729 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1050378.0, ans=0.0 2023-06-24 07:23:28,012 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.87 vs. limit=15.0 2023-06-24 07:23:30,366 INFO [train.py:996] (0/4) Epoch 6, batch 22600, loss[loss=0.243, simple_loss=0.3223, pruned_loss=0.0819, over 21886.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.2989, pruned_loss=0.0764, over 4283671.97 frames. ], batch size: 372, lr: 4.99e-03, grad_scale: 16.0 2023-06-24 07:23:59,097 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.27 vs. limit=15.0 2023-06-24 07:24:55,094 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1050678.0, ans=0.1 2023-06-24 07:25:23,784 INFO [train.py:996] (0/4) Epoch 6, batch 22650, loss[loss=0.2243, simple_loss=0.301, pruned_loss=0.0738, over 15203.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.2956, pruned_loss=0.07591, over 4265750.95 frames. ], batch size: 60, lr: 4.99e-03, grad_scale: 16.0 2023-06-24 07:26:07,809 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.240e+02 2.713e+02 2.934e+02 3.379e+02 4.768e+02, threshold=5.868e+02, percent-clipped=0.0 2023-06-24 07:26:34,162 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1050918.0, ans=0.125 2023-06-24 07:26:39,803 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.24 vs. limit=15.0 2023-06-24 07:27:04,018 INFO [train.py:996] (0/4) Epoch 6, batch 22700, loss[loss=0.1956, simple_loss=0.2636, pruned_loss=0.06376, over 21808.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.2889, pruned_loss=0.07463, over 4255369.67 frames. ], batch size: 102, lr: 4.99e-03, grad_scale: 16.0 2023-06-24 07:27:13,521 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1051038.0, ans=0.0 2023-06-24 07:28:18,286 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1051218.0, ans=0.025 2023-06-24 07:28:36,657 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1051278.0, ans=0.125 2023-06-24 07:28:55,485 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1051338.0, ans=0.1 2023-06-24 07:28:56,571 INFO [train.py:996] (0/4) Epoch 6, batch 22750, loss[loss=0.2491, simple_loss=0.3171, pruned_loss=0.09058, over 20696.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.2902, pruned_loss=0.07628, over 4260369.18 frames. ], batch size: 607, lr: 4.99e-03, grad_scale: 16.0 2023-06-24 07:29:14,580 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.34 vs. limit=12.0 2023-06-24 07:29:41,510 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.135e+02 2.658e+02 2.967e+02 3.229e+02 5.067e+02, threshold=5.933e+02, percent-clipped=0.0 2023-06-24 07:29:42,284 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1051458.0, ans=0.0 2023-06-24 07:29:42,358 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1051458.0, ans=0.2 2023-06-24 07:30:03,957 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.67 vs. limit=15.0 2023-06-24 07:30:13,938 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1051518.0, ans=0.0 2023-06-24 07:30:28,228 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.27 vs. limit=10.0 2023-06-24 07:30:29,307 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1051578.0, ans=0.0 2023-06-24 07:30:49,465 INFO [train.py:996] (0/4) Epoch 6, batch 22800, loss[loss=0.2369, simple_loss=0.3017, pruned_loss=0.08604, over 21413.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.2944, pruned_loss=0.07832, over 4267822.40 frames. ], batch size: 159, lr: 4.99e-03, grad_scale: 32.0 2023-06-24 07:32:31,121 INFO [train.py:996] (0/4) Epoch 6, batch 22850, loss[loss=0.1892, simple_loss=0.2539, pruned_loss=0.0623, over 21740.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.29, pruned_loss=0.07753, over 4264420.42 frames. ], batch size: 283, lr: 4.99e-03, grad_scale: 16.0 2023-06-24 07:33:02,137 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1051998.0, ans=0.1 2023-06-24 07:33:14,545 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.102e+02 2.514e+02 2.935e+02 3.337e+02 4.796e+02, threshold=5.870e+02, percent-clipped=0.0 2023-06-24 07:34:22,983 INFO [train.py:996] (0/4) Epoch 6, batch 22900, loss[loss=0.1777, simple_loss=0.2425, pruned_loss=0.0564, over 21835.00 frames. ], tot_loss[loss=0.223, simple_loss=0.2924, pruned_loss=0.07679, over 4247769.10 frames. ], batch size: 107, lr: 4.99e-03, grad_scale: 16.0 2023-06-24 07:34:40,099 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1052298.0, ans=0.125 2023-06-24 07:34:45,532 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1052298.0, ans=0.1 2023-06-24 07:36:03,200 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.14 vs. limit=15.0 2023-06-24 07:36:14,159 INFO [train.py:996] (0/4) Epoch 6, batch 22950, loss[loss=0.2438, simple_loss=0.3561, pruned_loss=0.06574, over 21782.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3062, pruned_loss=0.07598, over 4253178.13 frames. ], batch size: 316, lr: 4.99e-03, grad_scale: 16.0 2023-06-24 07:36:18,894 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.69 vs. limit=15.0 2023-06-24 07:36:46,721 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1052598.0, ans=0.0 2023-06-24 07:36:48,961 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.79 vs. limit=15.0 2023-06-24 07:36:56,202 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.951e+02 2.392e+02 2.733e+02 3.196e+02 4.909e+02, threshold=5.466e+02, percent-clipped=0.0 2023-06-24 07:37:05,837 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1052658.0, ans=0.2 2023-06-24 07:37:09,464 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1052658.0, ans=0.1 2023-06-24 07:37:12,716 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1052658.0, ans=0.125 2023-06-24 07:37:18,085 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1052718.0, ans=0.125 2023-06-24 07:37:40,895 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.82 vs. limit=12.0 2023-06-24 07:38:02,391 INFO [train.py:996] (0/4) Epoch 6, batch 23000, loss[loss=0.2356, simple_loss=0.2987, pruned_loss=0.08625, over 21246.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.3053, pruned_loss=0.07474, over 4261640.93 frames. ], batch size: 143, lr: 4.98e-03, grad_scale: 16.0 2023-06-24 07:38:56,587 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1052958.0, ans=0.0 2023-06-24 07:39:10,451 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1053018.0, ans=0.2 2023-06-24 07:39:15,619 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1053018.0, ans=0.0 2023-06-24 07:39:28,613 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.18 vs. limit=22.5 2023-06-24 07:39:33,870 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1053078.0, ans=0.5 2023-06-24 07:39:40,754 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1053078.0, ans=0.07 2023-06-24 07:39:58,071 INFO [train.py:996] (0/4) Epoch 6, batch 23050, loss[loss=0.277, simple_loss=0.3463, pruned_loss=0.1039, over 21790.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3076, pruned_loss=0.0778, over 4263973.20 frames. ], batch size: 441, lr: 4.98e-03, grad_scale: 16.0 2023-06-24 07:40:41,053 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.131e+02 2.622e+02 2.848e+02 3.330e+02 6.770e+02, threshold=5.696e+02, percent-clipped=1.0 2023-06-24 07:41:07,326 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.76 vs. limit=22.5 2023-06-24 07:41:30,648 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1053378.0, ans=0.0 2023-06-24 07:41:48,459 INFO [train.py:996] (0/4) Epoch 6, batch 23100, loss[loss=0.2034, simple_loss=0.2715, pruned_loss=0.06764, over 21784.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3031, pruned_loss=0.0781, over 4264883.08 frames. ], batch size: 118, lr: 4.98e-03, grad_scale: 16.0 2023-06-24 07:42:14,052 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.71 vs. limit=6.0 2023-06-24 07:42:15,001 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1053498.0, ans=0.0 2023-06-24 07:42:22,873 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=10.07 vs. limit=15.0 2023-06-24 07:42:30,701 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1053558.0, ans=0.1 2023-06-24 07:42:45,019 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1053558.0, ans=0.125 2023-06-24 07:43:35,983 INFO [train.py:996] (0/4) Epoch 6, batch 23150, loss[loss=0.2439, simple_loss=0.3044, pruned_loss=0.0917, over 21767.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.2968, pruned_loss=0.07702, over 4272660.24 frames. ], batch size: 441, lr: 4.98e-03, grad_scale: 16.0 2023-06-24 07:43:36,565 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 07:43:41,589 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1053738.0, ans=0.2 2023-06-24 07:43:47,417 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.69 vs. limit=15.0 2023-06-24 07:44:16,024 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=10.16 vs. limit=15.0 2023-06-24 07:44:16,355 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.069e+02 2.510e+02 2.867e+02 3.300e+02 5.681e+02, threshold=5.734e+02, percent-clipped=0.0 2023-06-24 07:44:28,709 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff2.min_abs, batch_count=1053858.0, ans=0.1 2023-06-24 07:45:15,948 INFO [train.py:996] (0/4) Epoch 6, batch 23200, loss[loss=0.2051, simple_loss=0.2755, pruned_loss=0.06739, over 21550.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.2953, pruned_loss=0.0767, over 4276110.77 frames. ], batch size: 212, lr: 4.98e-03, grad_scale: 32.0 2023-06-24 07:45:39,167 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1054098.0, ans=10.0 2023-06-24 07:46:49,583 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1054278.0, ans=0.125 2023-06-24 07:47:02,637 INFO [train.py:996] (0/4) Epoch 6, batch 23250, loss[loss=0.2255, simple_loss=0.2913, pruned_loss=0.0799, over 21815.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.2948, pruned_loss=0.07734, over 4287504.95 frames. ], batch size: 282, lr: 4.98e-03, grad_scale: 32.0 2023-06-24 07:47:34,077 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1054398.0, ans=0.2 2023-06-24 07:47:42,309 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1054398.0, ans=0.125 2023-06-24 07:47:44,188 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1054398.0, ans=0.125 2023-06-24 07:47:56,067 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.962e+02 2.641e+02 2.998e+02 3.541e+02 5.576e+02, threshold=5.996e+02, percent-clipped=0.0 2023-06-24 07:48:07,271 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1054458.0, ans=0.125 2023-06-24 07:48:50,013 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1054578.0, ans=0.125 2023-06-24 07:48:58,005 INFO [train.py:996] (0/4) Epoch 6, batch 23300, loss[loss=0.2971, simple_loss=0.394, pruned_loss=0.1001, over 21520.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3026, pruned_loss=0.0793, over 4288808.18 frames. ], batch size: 471, lr: 4.98e-03, grad_scale: 16.0 2023-06-24 07:49:09,495 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1054638.0, ans=0.07 2023-06-24 07:49:46,986 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.77 vs. limit=22.5 2023-06-24 07:49:54,298 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.91 vs. limit=22.5 2023-06-24 07:50:03,970 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1054758.0, ans=0.035 2023-06-24 07:50:11,068 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1054818.0, ans=0.1 2023-06-24 07:50:46,439 INFO [train.py:996] (0/4) Epoch 6, batch 23350, loss[loss=0.1567, simple_loss=0.2382, pruned_loss=0.03762, over 21263.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3064, pruned_loss=0.07798, over 4278424.90 frames. ], batch size: 176, lr: 4.98e-03, grad_scale: 16.0 2023-06-24 07:51:05,473 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1054938.0, ans=0.05 2023-06-24 07:51:41,624 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.696e+02 2.545e+02 3.075e+02 3.480e+02 4.848e+02, threshold=6.150e+02, percent-clipped=0.0 2023-06-24 07:51:49,725 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.66 vs. limit=15.0 2023-06-24 07:52:28,691 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1055178.0, ans=0.125 2023-06-24 07:52:34,863 INFO [train.py:996] (0/4) Epoch 6, batch 23400, loss[loss=0.2087, simple_loss=0.278, pruned_loss=0.06974, over 21276.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.299, pruned_loss=0.07391, over 4279330.68 frames. ], batch size: 159, lr: 4.98e-03, grad_scale: 16.0 2023-06-24 07:52:56,280 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.11 vs. limit=15.0 2023-06-24 07:53:45,354 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1055418.0, ans=0.1 2023-06-24 07:54:33,580 INFO [train.py:996] (0/4) Epoch 6, batch 23450, loss[loss=0.2561, simple_loss=0.3275, pruned_loss=0.09233, over 21288.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.3003, pruned_loss=0.07643, over 4278477.00 frames. ], batch size: 143, lr: 4.98e-03, grad_scale: 16.0 2023-06-24 07:55:16,985 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.29 vs. limit=5.0 2023-06-24 07:55:17,219 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.107e+02 2.530e+02 2.834e+02 3.227e+02 5.088e+02, threshold=5.668e+02, percent-clipped=0.0 2023-06-24 07:55:24,870 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1055658.0, ans=0.1 2023-06-24 07:55:35,235 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1055718.0, ans=0.125 2023-06-24 07:56:20,864 INFO [train.py:996] (0/4) Epoch 6, batch 23500, loss[loss=0.2159, simple_loss=0.2845, pruned_loss=0.07365, over 21833.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3004, pruned_loss=0.07817, over 4285365.87 frames. ], batch size: 247, lr: 4.98e-03, grad_scale: 16.0 2023-06-24 07:56:27,254 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.15 vs. limit=22.5 2023-06-24 07:56:50,340 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.35 vs. limit=15.0 2023-06-24 07:57:11,025 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-176000.pt 2023-06-24 07:57:25,130 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1056018.0, ans=0.125 2023-06-24 07:57:33,666 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1056018.0, ans=0.125 2023-06-24 07:57:47,309 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1056078.0, ans=0.1 2023-06-24 07:58:01,167 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff3.min_abs, batch_count=1056078.0, ans=0.2 2023-06-24 07:58:03,413 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.60 vs. limit=15.0 2023-06-24 07:58:09,288 INFO [train.py:996] (0/4) Epoch 6, batch 23550, loss[loss=0.2092, simple_loss=0.2788, pruned_loss=0.0698, over 21866.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.2944, pruned_loss=0.07724, over 4271945.21 frames. ], batch size: 107, lr: 4.98e-03, grad_scale: 16.0 2023-06-24 07:58:46,499 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1056198.0, ans=0.0 2023-06-24 07:58:52,558 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.092e+02 2.611e+02 2.905e+02 3.629e+02 5.861e+02, threshold=5.811e+02, percent-clipped=1.0 2023-06-24 07:59:03,744 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1056258.0, ans=0.125 2023-06-24 07:59:08,883 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1056318.0, ans=0.0 2023-06-24 07:59:57,759 INFO [train.py:996] (0/4) Epoch 6, batch 23600, loss[loss=0.3037, simple_loss=0.3726, pruned_loss=0.1174, over 21847.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.2961, pruned_loss=0.07774, over 4261710.43 frames. ], batch size: 124, lr: 4.98e-03, grad_scale: 32.0 2023-06-24 08:00:20,820 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1056498.0, ans=0.125 2023-06-24 08:00:33,394 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1056498.0, ans=0.2 2023-06-24 08:01:52,008 INFO [train.py:996] (0/4) Epoch 6, batch 23650, loss[loss=0.2796, simple_loss=0.3481, pruned_loss=0.1056, over 21381.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.297, pruned_loss=0.07584, over 4264188.62 frames. ], batch size: 507, lr: 4.98e-03, grad_scale: 16.0 2023-06-24 08:01:59,736 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1056738.0, ans=0.1 2023-06-24 08:02:06,869 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1056738.0, ans=0.125 2023-06-24 08:02:13,694 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1056798.0, ans=0.0 2023-06-24 08:02:37,240 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1056858.0, ans=0.0 2023-06-24 08:02:38,201 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.907e+02 2.630e+02 3.092e+02 3.541e+02 6.593e+02, threshold=6.183e+02, percent-clipped=1.0 2023-06-24 08:02:46,132 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1056858.0, ans=0.125 2023-06-24 08:03:27,821 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1056978.0, ans=0.125 2023-06-24 08:03:40,795 INFO [train.py:996] (0/4) Epoch 6, batch 23700, loss[loss=0.2022, simple_loss=0.275, pruned_loss=0.06464, over 21297.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.2994, pruned_loss=0.07571, over 4265447.49 frames. ], batch size: 176, lr: 4.97e-03, grad_scale: 16.0 2023-06-24 08:03:46,984 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1057038.0, ans=0.125 2023-06-24 08:03:57,774 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1057098.0, ans=0.0 2023-06-24 08:04:28,663 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1057158.0, ans=0.125 2023-06-24 08:04:56,171 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1057218.0, ans=0.1 2023-06-24 08:05:20,586 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1057278.0, ans=0.1 2023-06-24 08:05:31,887 INFO [train.py:996] (0/4) Epoch 6, batch 23750, loss[loss=0.1849, simple_loss=0.2815, pruned_loss=0.04416, over 21751.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3024, pruned_loss=0.07661, over 4267714.57 frames. ], batch size: 332, lr: 4.97e-03, grad_scale: 16.0 2023-06-24 08:05:57,152 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=1057398.0, ans=6.0 2023-06-24 08:06:02,458 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1057398.0, ans=0.0 2023-06-24 08:06:26,820 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.851e+02 2.305e+02 2.862e+02 3.715e+02 6.571e+02, threshold=5.724e+02, percent-clipped=1.0 2023-06-24 08:06:34,692 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1057458.0, ans=0.125 2023-06-24 08:07:00,506 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.65 vs. limit=10.0 2023-06-24 08:07:21,358 INFO [train.py:996] (0/4) Epoch 6, batch 23800, loss[loss=0.3225, simple_loss=0.3935, pruned_loss=0.1257, over 21439.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.3019, pruned_loss=0.07467, over 4262481.82 frames. ], batch size: 471, lr: 4.97e-03, grad_scale: 16.0 2023-06-24 08:07:39,097 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 08:08:00,238 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1057698.0, ans=0.125 2023-06-24 08:08:12,353 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1057758.0, ans=0.2 2023-06-24 08:08:12,375 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1057758.0, ans=0.95 2023-06-24 08:09:17,524 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.11 vs. limit=15.0 2023-06-24 08:09:18,108 INFO [train.py:996] (0/4) Epoch 6, batch 23850, loss[loss=0.2289, simple_loss=0.305, pruned_loss=0.07636, over 21380.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3094, pruned_loss=0.07681, over 4264471.05 frames. ], batch size: 176, lr: 4.97e-03, grad_scale: 16.0 2023-06-24 08:10:14,727 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.170e+02 2.790e+02 3.206e+02 3.794e+02 6.982e+02, threshold=6.412e+02, percent-clipped=2.0 2023-06-24 08:10:17,783 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.71 vs. limit=12.0 2023-06-24 08:10:32,846 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1058118.0, ans=0.125 2023-06-24 08:10:59,630 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1058178.0, ans=0.2 2023-06-24 08:11:12,039 INFO [train.py:996] (0/4) Epoch 6, batch 23900, loss[loss=0.2264, simple_loss=0.3103, pruned_loss=0.07124, over 21736.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3161, pruned_loss=0.07813, over 4264970.41 frames. ], batch size: 124, lr: 4.97e-03, grad_scale: 16.0 2023-06-24 08:12:28,674 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1058418.0, ans=0.125 2023-06-24 08:12:57,494 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1058478.0, ans=0.125 2023-06-24 08:13:00,269 INFO [train.py:996] (0/4) Epoch 6, batch 23950, loss[loss=0.2084, simple_loss=0.2689, pruned_loss=0.07395, over 21451.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3092, pruned_loss=0.07758, over 4254524.91 frames. ], batch size: 211, lr: 4.97e-03, grad_scale: 16.0 2023-06-24 08:13:17,330 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1058538.0, ans=0.1 2023-06-24 08:13:49,931 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1058658.0, ans=0.1 2023-06-24 08:13:52,876 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.147e+02 2.675e+02 3.021e+02 3.458e+02 5.557e+02, threshold=6.041e+02, percent-clipped=0.0 2023-06-24 08:13:57,905 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.07 vs. limit=10.0 2023-06-24 08:14:55,870 INFO [train.py:996] (0/4) Epoch 6, batch 24000, loss[loss=0.2889, simple_loss=0.3551, pruned_loss=0.1113, over 21596.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3105, pruned_loss=0.08058, over 4245583.48 frames. ], batch size: 415, lr: 4.97e-03, grad_scale: 32.0 2023-06-24 08:14:55,872 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-24 08:15:17,149 INFO [train.py:1028] (0/4) Epoch 6, validation: loss=0.2634, simple_loss=0.3603, pruned_loss=0.08319, over 1796401.00 frames. 2023-06-24 08:15:17,150 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23616MB 2023-06-24 08:16:34,353 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1059018.0, ans=0.1 2023-06-24 08:16:48,338 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1059078.0, ans=0.125 2023-06-24 08:16:48,383 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1059078.0, ans=0.125 2023-06-24 08:17:08,168 INFO [train.py:996] (0/4) Epoch 6, batch 24050, loss[loss=0.1874, simple_loss=0.2765, pruned_loss=0.04915, over 21402.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.3127, pruned_loss=0.08145, over 4257681.45 frames. ], batch size: 194, lr: 4.97e-03, grad_scale: 16.0 2023-06-24 08:17:24,859 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1059198.0, ans=0.125 2023-06-24 08:17:48,061 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1059258.0, ans=0.125 2023-06-24 08:17:56,749 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.988e+02 2.625e+02 3.028e+02 3.764e+02 6.671e+02, threshold=6.056e+02, percent-clipped=1.0 2023-06-24 08:18:28,061 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.67 vs. limit=15.0 2023-06-24 08:18:43,502 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1059378.0, ans=0.125 2023-06-24 08:18:44,057 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.17 vs. limit=22.5 2023-06-24 08:18:56,460 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1059378.0, ans=0.0 2023-06-24 08:18:59,229 INFO [train.py:996] (0/4) Epoch 6, batch 24100, loss[loss=0.3005, simple_loss=0.3639, pruned_loss=0.1185, over 21446.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3126, pruned_loss=0.07957, over 4260238.22 frames. ], batch size: 471, lr: 4.97e-03, grad_scale: 16.0 2023-06-24 08:20:15,520 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1059618.0, ans=0.05 2023-06-24 08:20:49,102 INFO [train.py:996] (0/4) Epoch 6, batch 24150, loss[loss=0.2697, simple_loss=0.3219, pruned_loss=0.1088, over 21599.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3121, pruned_loss=0.08096, over 4266451.66 frames. ], batch size: 471, lr: 4.97e-03, grad_scale: 16.0 2023-06-24 08:21:05,874 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1059798.0, ans=0.0 2023-06-24 08:21:16,747 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1059798.0, ans=0.1 2023-06-24 08:21:43,499 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.053e+02 2.678e+02 3.013e+02 3.443e+02 5.621e+02, threshold=6.026e+02, percent-clipped=0.0 2023-06-24 08:22:00,116 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=2.560e-03 2023-06-24 08:22:05,291 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1059918.0, ans=0.125 2023-06-24 08:22:37,712 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1059978.0, ans=0.125 2023-06-24 08:22:40,794 INFO [train.py:996] (0/4) Epoch 6, batch 24200, loss[loss=0.2342, simple_loss=0.3094, pruned_loss=0.07953, over 21481.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.314, pruned_loss=0.08182, over 4275245.43 frames. ], batch size: 212, lr: 4.97e-03, grad_scale: 16.0 2023-06-24 08:22:52,534 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1060038.0, ans=0.0 2023-06-24 08:22:53,097 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.91 vs. limit=12.0 2023-06-24 08:23:10,941 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1060098.0, ans=0.125 2023-06-24 08:24:10,099 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1060278.0, ans=0.1 2023-06-24 08:24:13,968 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1060278.0, ans=0.125 2023-06-24 08:24:27,613 INFO [train.py:996] (0/4) Epoch 6, batch 24250, loss[loss=0.2405, simple_loss=0.3303, pruned_loss=0.07538, over 21452.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3114, pruned_loss=0.07543, over 4276640.73 frames. ], batch size: 507, lr: 4.97e-03, grad_scale: 16.0 2023-06-24 08:24:33,286 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1060338.0, ans=0.1 2023-06-24 08:24:53,191 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1060338.0, ans=0.0 2023-06-24 08:25:12,972 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.76 vs. limit=22.5 2023-06-24 08:25:25,862 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.522e+02 2.253e+02 2.770e+02 3.370e+02 5.813e+02, threshold=5.539e+02, percent-clipped=0.0 2023-06-24 08:25:54,908 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.46 vs. limit=6.0 2023-06-24 08:26:15,753 INFO [train.py:996] (0/4) Epoch 6, batch 24300, loss[loss=0.1557, simple_loss=0.239, pruned_loss=0.0362, over 21757.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.3035, pruned_loss=0.06998, over 4277227.50 frames. ], batch size: 282, lr: 4.97e-03, grad_scale: 16.0 2023-06-24 08:26:53,579 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.14 vs. limit=15.0 2023-06-24 08:26:54,590 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 08:26:54,665 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1060758.0, ans=0.0 2023-06-24 08:27:53,129 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.73 vs. limit=15.0 2023-06-24 08:28:09,177 INFO [train.py:996] (0/4) Epoch 6, batch 24350, loss[loss=0.2075, simple_loss=0.2642, pruned_loss=0.07539, over 20234.00 frames. ], tot_loss[loss=0.221, simple_loss=0.3002, pruned_loss=0.07094, over 4282170.65 frames. ], batch size: 702, lr: 4.97e-03, grad_scale: 16.0 2023-06-24 08:28:11,914 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 08:28:39,922 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.69 vs. limit=15.0 2023-06-24 08:29:01,796 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.594e+02 2.610e+02 2.946e+02 3.475e+02 5.631e+02, threshold=5.892e+02, percent-clipped=1.0 2023-06-24 08:29:29,252 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1061178.0, ans=0.125 2023-06-24 08:29:58,785 INFO [train.py:996] (0/4) Epoch 6, batch 24400, loss[loss=0.222, simple_loss=0.295, pruned_loss=0.0745, over 21694.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.3042, pruned_loss=0.07355, over 4278653.72 frames. ], batch size: 351, lr: 4.97e-03, grad_scale: 32.0 2023-06-24 08:30:52,939 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1061358.0, ans=0.125 2023-06-24 08:31:16,894 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.64 vs. limit=15.0 2023-06-24 08:31:37,635 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1061478.0, ans=0.125 2023-06-24 08:31:41,885 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.49 vs. limit=6.0 2023-06-24 08:31:42,815 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1061478.0, ans=0.125 2023-06-24 08:31:49,136 INFO [train.py:996] (0/4) Epoch 6, batch 24450, loss[loss=0.2848, simple_loss=0.3771, pruned_loss=0.09624, over 21678.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.3076, pruned_loss=0.0757, over 4271097.40 frames. ], batch size: 414, lr: 4.96e-03, grad_scale: 32.0 2023-06-24 08:31:53,835 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1061538.0, ans=0.125 2023-06-24 08:31:56,894 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1061538.0, ans=0.2 2023-06-24 08:32:28,202 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1061598.0, ans=0.125 2023-06-24 08:32:41,892 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.056e+02 2.780e+02 3.190e+02 3.668e+02 5.575e+02, threshold=6.380e+02, percent-clipped=0.0 2023-06-24 08:32:44,745 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.66 vs. limit=15.0 2023-06-24 08:32:51,935 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=10.53 vs. limit=15.0 2023-06-24 08:33:03,964 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.27 vs. limit=12.0 2023-06-24 08:33:15,713 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1061718.0, ans=0.125 2023-06-24 08:33:17,185 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1061718.0, ans=0.0 2023-06-24 08:33:17,273 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1061718.0, ans=0.04949747468305833 2023-06-24 08:33:22,868 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.17 vs. limit=15.0 2023-06-24 08:33:28,018 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.82 vs. limit=15.0 2023-06-24 08:33:37,486 INFO [train.py:996] (0/4) Epoch 6, batch 24500, loss[loss=0.234, simple_loss=0.3096, pruned_loss=0.07921, over 21883.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3088, pruned_loss=0.07605, over 4277238.55 frames. ], batch size: 107, lr: 4.96e-03, grad_scale: 32.0 2023-06-24 08:34:17,481 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1061898.0, ans=0.0 2023-06-24 08:34:37,656 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.39 vs. limit=15.0 2023-06-24 08:34:40,565 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1062018.0, ans=0.125 2023-06-24 08:35:28,559 INFO [train.py:996] (0/4) Epoch 6, batch 24550, loss[loss=0.2788, simple_loss=0.3559, pruned_loss=0.1008, over 21552.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3106, pruned_loss=0.07854, over 4274247.98 frames. ], batch size: 414, lr: 4.96e-03, grad_scale: 16.0 2023-06-24 08:35:49,424 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.11 vs. limit=15.0 2023-06-24 08:35:53,331 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.75 vs. limit=15.0 2023-06-24 08:35:54,552 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1062198.0, ans=0.125 2023-06-24 08:36:18,658 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.031e+02 2.580e+02 2.942e+02 3.468e+02 6.882e+02, threshold=5.884e+02, percent-clipped=1.0 2023-06-24 08:36:26,808 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1062258.0, ans=0.04949747468305833 2023-06-24 08:36:26,919 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1062258.0, ans=0.0 2023-06-24 08:36:49,371 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1062318.0, ans=0.2 2023-06-24 08:37:01,288 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1062378.0, ans=0.125 2023-06-24 08:37:18,641 INFO [train.py:996] (0/4) Epoch 6, batch 24600, loss[loss=0.2041, simple_loss=0.2616, pruned_loss=0.07325, over 21229.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3058, pruned_loss=0.07829, over 4278544.26 frames. ], batch size: 143, lr: 4.96e-03, grad_scale: 16.0 2023-06-24 08:37:54,661 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 08:39:14,750 INFO [train.py:996] (0/4) Epoch 6, batch 24650, loss[loss=0.1959, simple_loss=0.2602, pruned_loss=0.06582, over 21554.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.2977, pruned_loss=0.07709, over 4265872.03 frames. ], batch size: 391, lr: 4.96e-03, grad_scale: 16.0 2023-06-24 08:39:15,456 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1062738.0, ans=0.025 2023-06-24 08:39:20,376 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1062738.0, ans=0.125 2023-06-24 08:39:43,217 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 08:40:02,612 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.42 vs. limit=15.0 2023-06-24 08:40:03,028 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.821e+02 2.707e+02 3.176e+02 3.617e+02 5.573e+02, threshold=6.353e+02, percent-clipped=0.0 2023-06-24 08:40:11,388 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1062858.0, ans=0.125 2023-06-24 08:40:47,505 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=22.07 vs. limit=22.5 2023-06-24 08:40:54,279 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.52 vs. limit=12.0 2023-06-24 08:41:03,588 INFO [train.py:996] (0/4) Epoch 6, batch 24700, loss[loss=0.1931, simple_loss=0.2618, pruned_loss=0.06216, over 21824.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.2956, pruned_loss=0.07566, over 4258127.76 frames. ], batch size: 98, lr: 4.96e-03, grad_scale: 16.0 2023-06-24 08:42:19,483 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1063218.0, ans=0.125 2023-06-24 08:42:30,166 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1063278.0, ans=0.0 2023-06-24 08:42:52,524 INFO [train.py:996] (0/4) Epoch 6, batch 24750, loss[loss=0.1784, simple_loss=0.2459, pruned_loss=0.05549, over 21643.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.2901, pruned_loss=0.07371, over 4266930.90 frames. ], batch size: 282, lr: 4.96e-03, grad_scale: 16.0 2023-06-24 08:42:56,651 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1063338.0, ans=0.0 2023-06-24 08:43:26,631 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1063458.0, ans=0.09899494936611666 2023-06-24 08:43:33,383 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1063458.0, ans=0.2 2023-06-24 08:43:41,627 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.995e+02 2.438e+02 2.880e+02 3.643e+02 9.109e+02, threshold=5.760e+02, percent-clipped=1.0 2023-06-24 08:44:33,182 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1063578.0, ans=0.0 2023-06-24 08:44:36,351 INFO [train.py:996] (0/4) Epoch 6, batch 24800, loss[loss=0.222, simple_loss=0.2877, pruned_loss=0.07811, over 21623.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2846, pruned_loss=0.07288, over 4274688.57 frames. ], batch size: 263, lr: 4.96e-03, grad_scale: 32.0 2023-06-24 08:44:48,294 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1063638.0, ans=0.125 2023-06-24 08:44:51,782 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1063638.0, ans=0.0 2023-06-24 08:45:33,722 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1063758.0, ans=0.025 2023-06-24 08:46:26,850 INFO [train.py:996] (0/4) Epoch 6, batch 24850, loss[loss=0.1972, simple_loss=0.2665, pruned_loss=0.06393, over 21750.00 frames. ], tot_loss[loss=0.216, simple_loss=0.2846, pruned_loss=0.07374, over 4281785.45 frames. ], batch size: 247, lr: 4.96e-03, grad_scale: 32.0 2023-06-24 08:47:02,882 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1063998.0, ans=0.035 2023-06-24 08:47:07,175 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.41 vs. limit=15.0 2023-06-24 08:47:21,020 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.206e+02 2.836e+02 3.370e+02 3.940e+02 7.201e+02, threshold=6.739e+02, percent-clipped=1.0 2023-06-24 08:47:36,700 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.42 vs. limit=15.0 2023-06-24 08:48:20,715 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1064238.0, ans=0.125 2023-06-24 08:48:21,643 INFO [train.py:996] (0/4) Epoch 6, batch 24900, loss[loss=0.2987, simple_loss=0.363, pruned_loss=0.1171, over 21429.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.2873, pruned_loss=0.07485, over 4272037.31 frames. ], batch size: 471, lr: 4.96e-03, grad_scale: 32.0 2023-06-24 08:48:47,462 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1064298.0, ans=0.0 2023-06-24 08:50:14,171 INFO [train.py:996] (0/4) Epoch 6, batch 24950, loss[loss=0.2321, simple_loss=0.2995, pruned_loss=0.08238, over 20678.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.296, pruned_loss=0.07926, over 4271068.94 frames. ], batch size: 607, lr: 4.96e-03, grad_scale: 16.0 2023-06-24 08:50:58,637 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1064598.0, ans=0.0 2023-06-24 08:50:58,652 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1064598.0, ans=0.125 2023-06-24 08:51:12,089 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.415e+02 2.867e+02 3.295e+02 3.992e+02 6.156e+02, threshold=6.590e+02, percent-clipped=0.0 2023-06-24 08:51:47,834 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1064778.0, ans=0.0 2023-06-24 08:52:06,483 INFO [train.py:996] (0/4) Epoch 6, batch 25000, loss[loss=0.2443, simple_loss=0.3173, pruned_loss=0.08569, over 20723.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3035, pruned_loss=0.081, over 4274373.67 frames. ], batch size: 607, lr: 4.96e-03, grad_scale: 16.0 2023-06-24 08:52:12,452 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1064838.0, ans=0.0 2023-06-24 08:52:28,917 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.08 vs. limit=15.0 2023-06-24 08:52:52,816 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1064958.0, ans=0.125 2023-06-24 08:53:35,990 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1065078.0, ans=0.07 2023-06-24 08:53:42,842 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1065078.0, ans=0.125 2023-06-24 08:53:54,474 INFO [train.py:996] (0/4) Epoch 6, batch 25050, loss[loss=0.2032, simple_loss=0.2643, pruned_loss=0.07111, over 21749.00 frames. ], tot_loss[loss=0.2276, simple_loss=0.2969, pruned_loss=0.07918, over 4270733.42 frames. ], batch size: 317, lr: 4.96e-03, grad_scale: 16.0 2023-06-24 08:53:56,986 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1065138.0, ans=0.0 2023-06-24 08:54:24,057 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1065198.0, ans=0.125 2023-06-24 08:54:41,170 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1065258.0, ans=0.0 2023-06-24 08:54:56,058 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.014e+02 2.544e+02 2.890e+02 3.638e+02 5.399e+02, threshold=5.780e+02, percent-clipped=0.0 2023-06-24 08:55:43,750 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.72 vs. limit=12.0 2023-06-24 08:55:44,186 INFO [train.py:996] (0/4) Epoch 6, batch 25100, loss[loss=0.2483, simple_loss=0.3625, pruned_loss=0.06702, over 19738.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.2923, pruned_loss=0.07745, over 4270359.50 frames. ], batch size: 702, lr: 4.96e-03, grad_scale: 16.0 2023-06-24 08:55:56,873 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 08:56:03,645 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1065438.0, ans=0.035 2023-06-24 08:56:09,084 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1065498.0, ans=0.1 2023-06-24 08:56:26,288 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1065498.0, ans=0.1 2023-06-24 08:56:31,799 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1065558.0, ans=0.2 2023-06-24 08:56:38,356 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1065558.0, ans=0.125 2023-06-24 08:57:31,186 INFO [train.py:996] (0/4) Epoch 6, batch 25150, loss[loss=0.2826, simple_loss=0.3457, pruned_loss=0.1097, over 21687.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.2964, pruned_loss=0.07588, over 4257552.40 frames. ], batch size: 508, lr: 4.95e-03, grad_scale: 16.0 2023-06-24 08:57:49,807 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.32 vs. limit=15.0 2023-06-24 08:58:03,256 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.34 vs. limit=10.0 2023-06-24 08:58:27,069 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.838e+02 2.368e+02 2.837e+02 3.510e+02 8.139e+02, threshold=5.674e+02, percent-clipped=4.0 2023-06-24 08:58:47,869 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1065918.0, ans=0.125 2023-06-24 08:58:51,092 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1065918.0, ans=0.0 2023-06-24 08:59:20,751 INFO [train.py:996] (0/4) Epoch 6, batch 25200, loss[loss=0.213, simple_loss=0.3129, pruned_loss=0.05653, over 21717.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.2955, pruned_loss=0.07307, over 4263022.80 frames. ], batch size: 351, lr: 4.95e-03, grad_scale: 32.0 2023-06-24 08:59:36,996 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1066098.0, ans=0.2 2023-06-24 09:01:08,319 INFO [train.py:996] (0/4) Epoch 6, batch 25250, loss[loss=0.1877, simple_loss=0.2586, pruned_loss=0.05842, over 21694.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2922, pruned_loss=0.07183, over 4264526.98 frames. ], batch size: 282, lr: 4.95e-03, grad_scale: 32.0 2023-06-24 09:01:23,259 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1066338.0, ans=0.0 2023-06-24 09:01:25,021 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1066398.0, ans=0.125 2023-06-24 09:01:48,554 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.68 vs. limit=15.0 2023-06-24 09:02:12,211 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.941e+02 2.424e+02 2.718e+02 3.085e+02 4.421e+02, threshold=5.437e+02, percent-clipped=0.0 2023-06-24 09:02:45,887 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.77 vs. limit=12.0 2023-06-24 09:02:54,789 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.27 vs. limit=10.0 2023-06-24 09:02:58,788 INFO [train.py:996] (0/4) Epoch 6, batch 25300, loss[loss=0.2504, simple_loss=0.3273, pruned_loss=0.08674, over 21444.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2898, pruned_loss=0.07123, over 4262249.22 frames. ], batch size: 131, lr: 4.95e-03, grad_scale: 16.0 2023-06-24 09:03:22,075 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1066698.0, ans=0.0 2023-06-24 09:04:48,208 INFO [train.py:996] (0/4) Epoch 6, batch 25350, loss[loss=0.1694, simple_loss=0.2562, pruned_loss=0.04131, over 21363.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2917, pruned_loss=0.07076, over 4262373.33 frames. ], batch size: 194, lr: 4.95e-03, grad_scale: 16.0 2023-06-24 09:04:57,799 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1066938.0, ans=0.07 2023-06-24 09:05:45,201 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.48 vs. limit=22.5 2023-06-24 09:05:50,930 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.891e+02 2.518e+02 2.873e+02 3.506e+02 6.244e+02, threshold=5.746e+02, percent-clipped=2.0 2023-06-24 09:06:22,139 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.69 vs. limit=8.0 2023-06-24 09:06:27,603 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1067178.0, ans=0.125 2023-06-24 09:06:35,395 INFO [train.py:996] (0/4) Epoch 6, batch 25400, loss[loss=0.2062, simple_loss=0.2639, pruned_loss=0.07426, over 21165.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2875, pruned_loss=0.07055, over 4260412.53 frames. ], batch size: 548, lr: 4.95e-03, grad_scale: 16.0 2023-06-24 09:07:18,551 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.21 vs. limit=22.5 2023-06-24 09:07:32,065 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1067358.0, ans=0.0 2023-06-24 09:08:07,045 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.29 vs. limit=12.0 2023-06-24 09:08:25,235 INFO [train.py:996] (0/4) Epoch 6, batch 25450, loss[loss=0.2012, simple_loss=0.2995, pruned_loss=0.05148, over 21748.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.2891, pruned_loss=0.07212, over 4258182.56 frames. ], batch size: 298, lr: 4.95e-03, grad_scale: 16.0 2023-06-24 09:09:07,528 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.56 vs. limit=10.0 2023-06-24 09:09:30,506 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.916e+02 2.400e+02 2.613e+02 3.023e+02 4.754e+02, threshold=5.227e+02, percent-clipped=0.0 2023-06-24 09:09:33,317 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1067658.0, ans=0.125 2023-06-24 09:09:53,552 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1067718.0, ans=0.1 2023-06-24 09:09:59,674 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.06 vs. limit=22.5 2023-06-24 09:10:23,342 INFO [train.py:996] (0/4) Epoch 6, batch 25500, loss[loss=0.1805, simple_loss=0.2655, pruned_loss=0.04775, over 21384.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2892, pruned_loss=0.06923, over 4251047.96 frames. ], batch size: 211, lr: 4.95e-03, grad_scale: 16.0 2023-06-24 09:10:56,818 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1067898.0, ans=0.1 2023-06-24 09:11:23,206 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.27 vs. limit=15.0 2023-06-24 09:11:27,748 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1068018.0, ans=0.0 2023-06-24 09:11:38,557 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_ff3.min_abs, batch_count=1068018.0, ans=0.2 2023-06-24 09:11:54,864 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.30 vs. limit=12.0 2023-06-24 09:12:14,527 INFO [train.py:996] (0/4) Epoch 6, batch 25550, loss[loss=0.2104, simple_loss=0.3111, pruned_loss=0.05483, over 21768.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2951, pruned_loss=0.07003, over 4239066.12 frames. ], batch size: 332, lr: 4.95e-03, grad_scale: 16.0 2023-06-24 09:12:17,133 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1068138.0, ans=0.1 2023-06-24 09:12:17,223 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1068138.0, ans=0.0 2023-06-24 09:13:02,575 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1068258.0, ans=0.125 2023-06-24 09:13:20,094 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.858e+02 2.387e+02 2.706e+02 3.316e+02 5.632e+02, threshold=5.413e+02, percent-clipped=1.0 2023-06-24 09:14:05,896 INFO [train.py:996] (0/4) Epoch 6, batch 25600, loss[loss=0.2273, simple_loss=0.3039, pruned_loss=0.07537, over 21778.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.2992, pruned_loss=0.07053, over 4253190.40 frames. ], batch size: 298, lr: 4.95e-03, grad_scale: 32.0 2023-06-24 09:14:39,139 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.30 vs. limit=15.0 2023-06-24 09:14:49,446 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.42 vs. limit=15.0 2023-06-24 09:15:03,543 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1068558.0, ans=0.1 2023-06-24 09:15:23,241 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.38 vs. limit=15.0 2023-06-24 09:15:26,296 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1068618.0, ans=0.125 2023-06-24 09:15:39,783 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1068678.0, ans=0.0 2023-06-24 09:15:41,579 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1068678.0, ans=0.125 2023-06-24 09:15:59,350 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1068738.0, ans=0.125 2023-06-24 09:16:00,280 INFO [train.py:996] (0/4) Epoch 6, batch 25650, loss[loss=0.1956, simple_loss=0.2563, pruned_loss=0.0674, over 21555.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.3016, pruned_loss=0.07397, over 4263858.14 frames. ], batch size: 263, lr: 4.95e-03, grad_scale: 32.0 2023-06-24 09:16:52,126 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 09:16:56,698 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.182e+02 2.676e+02 3.048e+02 3.761e+02 7.606e+02, threshold=6.096e+02, percent-clipped=4.0 2023-06-24 09:17:21,524 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1068918.0, ans=0.05 2023-06-24 09:17:41,430 INFO [train.py:996] (0/4) Epoch 6, batch 25700, loss[loss=0.2278, simple_loss=0.3119, pruned_loss=0.07191, over 21271.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.2982, pruned_loss=0.07516, over 4271345.71 frames. ], batch size: 143, lr: 4.95e-03, grad_scale: 32.0 2023-06-24 09:17:57,229 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1069038.0, ans=0.125 2023-06-24 09:18:18,432 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1069098.0, ans=0.125 2023-06-24 09:18:53,412 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1069218.0, ans=0.2 2023-06-24 09:19:09,439 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1069218.0, ans=0.1 2023-06-24 09:19:31,836 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1069278.0, ans=0.0 2023-06-24 09:19:33,288 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1069278.0, ans=0.1 2023-06-24 09:19:39,964 INFO [train.py:996] (0/4) Epoch 6, batch 25750, loss[loss=0.3018, simple_loss=0.3669, pruned_loss=0.1184, over 21765.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3041, pruned_loss=0.07825, over 4281810.36 frames. ], batch size: 441, lr: 4.95e-03, grad_scale: 32.0 2023-06-24 09:19:49,185 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 09:20:39,002 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1069458.0, ans=0.1 2023-06-24 09:20:43,890 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.169e+02 2.640e+02 3.088e+02 3.573e+02 6.081e+02, threshold=6.175e+02, percent-clipped=0.0 2023-06-24 09:20:54,040 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.85 vs. limit=22.5 2023-06-24 09:21:02,834 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1069518.0, ans=0.05 2023-06-24 09:21:04,676 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1069518.0, ans=0.09899494936611666 2023-06-24 09:21:39,491 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.47 vs. limit=15.0 2023-06-24 09:21:42,018 INFO [train.py:996] (0/4) Epoch 6, batch 25800, loss[loss=0.2631, simple_loss=0.3567, pruned_loss=0.08473, over 21379.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3147, pruned_loss=0.08237, over 4280745.78 frames. ], batch size: 131, lr: 4.95e-03, grad_scale: 32.0 2023-06-24 09:21:48,009 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1069638.0, ans=0.2 2023-06-24 09:22:24,545 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1069758.0, ans=0.1 2023-06-24 09:22:26,024 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1069758.0, ans=0.125 2023-06-24 09:22:28,535 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.70 vs. limit=22.5 2023-06-24 09:22:29,668 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1069758.0, ans=0.125 2023-06-24 09:23:06,161 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1069878.0, ans=0.1 2023-06-24 09:23:30,418 INFO [train.py:996] (0/4) Epoch 6, batch 25850, loss[loss=0.2115, simple_loss=0.2876, pruned_loss=0.06768, over 21679.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3165, pruned_loss=0.08164, over 4284894.06 frames. ], batch size: 230, lr: 4.94e-03, grad_scale: 32.0 2023-06-24 09:23:32,937 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1069938.0, ans=0.125 2023-06-24 09:23:36,707 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1069938.0, ans=0.125 2023-06-24 09:23:40,251 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1069938.0, ans=0.1 2023-06-24 09:23:52,343 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1069998.0, ans=0.125 2023-06-24 09:24:10,644 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.37 vs. limit=22.5 2023-06-24 09:24:29,139 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.006e+02 2.685e+02 2.967e+02 3.484e+02 6.005e+02, threshold=5.935e+02, percent-clipped=0.0 2023-06-24 09:25:04,099 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1070178.0, ans=0.1 2023-06-24 09:25:10,837 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1070178.0, ans=0.0 2023-06-24 09:25:21,138 INFO [train.py:996] (0/4) Epoch 6, batch 25900, loss[loss=0.2915, simple_loss=0.3823, pruned_loss=0.1004, over 21357.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.3187, pruned_loss=0.08289, over 4289650.35 frames. ], batch size: 548, lr: 4.94e-03, grad_scale: 32.0 2023-06-24 09:25:59,330 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1070298.0, ans=0.125 2023-06-24 09:27:04,450 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1070478.0, ans=0.125 2023-06-24 09:27:16,107 INFO [train.py:996] (0/4) Epoch 6, batch 25950, loss[loss=0.2534, simple_loss=0.325, pruned_loss=0.0909, over 21254.00 frames. ], tot_loss[loss=0.2475, simple_loss=0.3247, pruned_loss=0.08519, over 4288084.43 frames. ], batch size: 159, lr: 4.94e-03, grad_scale: 32.0 2023-06-24 09:27:20,539 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1070538.0, ans=0.0 2023-06-24 09:27:35,455 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1070538.0, ans=0.0 2023-06-24 09:28:08,520 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1070658.0, ans=10.0 2023-06-24 09:28:20,618 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.030e+02 2.613e+02 2.969e+02 3.394e+02 6.568e+02, threshold=5.938e+02, percent-clipped=2.0 2023-06-24 09:28:32,096 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1070718.0, ans=0.0 2023-06-24 09:29:06,459 INFO [train.py:996] (0/4) Epoch 6, batch 26000, loss[loss=0.2108, simple_loss=0.306, pruned_loss=0.05779, over 21713.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3233, pruned_loss=0.08235, over 4276888.73 frames. ], batch size: 124, lr: 4.94e-03, grad_scale: 32.0 2023-06-24 09:29:06,973 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1070838.0, ans=0.0 2023-06-24 09:29:08,864 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1070838.0, ans=0.1 2023-06-24 09:29:08,889 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1070838.0, ans=0.2 2023-06-24 09:29:46,427 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.19 vs. limit=15.0 2023-06-24 09:29:50,010 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.72 vs. limit=10.0 2023-06-24 09:30:23,364 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1071018.0, ans=0.0 2023-06-24 09:31:00,839 INFO [train.py:996] (0/4) Epoch 6, batch 26050, loss[loss=0.2304, simple_loss=0.2977, pruned_loss=0.08156, over 21744.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.324, pruned_loss=0.08308, over 4272026.81 frames. ], batch size: 112, lr: 4.94e-03, grad_scale: 16.0 2023-06-24 09:31:01,392 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1071138.0, ans=0.0 2023-06-24 09:31:16,847 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1071198.0, ans=0.0 2023-06-24 09:31:37,242 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1071198.0, ans=0.0 2023-06-24 09:31:58,909 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.920e+02 2.591e+02 3.026e+02 3.549e+02 5.342e+02, threshold=6.052e+02, percent-clipped=0.0 2023-06-24 09:31:59,467 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1071258.0, ans=0.0 2023-06-24 09:31:59,536 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1071258.0, ans=0.125 2023-06-24 09:32:28,846 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1071378.0, ans=0.125 2023-06-24 09:32:47,966 INFO [train.py:996] (0/4) Epoch 6, batch 26100, loss[loss=0.2217, simple_loss=0.2834, pruned_loss=0.07998, over 21479.00 frames. ], tot_loss[loss=0.241, simple_loss=0.3171, pruned_loss=0.08244, over 4280423.59 frames. ], batch size: 194, lr: 4.94e-03, grad_scale: 16.0 2023-06-24 09:32:57,825 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1071438.0, ans=0.125 2023-06-24 09:33:04,584 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1071498.0, ans=0.125 2023-06-24 09:33:29,197 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1071498.0, ans=0.0 2023-06-24 09:34:15,825 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1071678.0, ans=0.125 2023-06-24 09:34:37,498 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1071738.0, ans=0.2 2023-06-24 09:34:38,494 INFO [train.py:996] (0/4) Epoch 6, batch 26150, loss[loss=0.2431, simple_loss=0.3238, pruned_loss=0.08123, over 21829.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.3144, pruned_loss=0.08261, over 4280757.56 frames. ], batch size: 118, lr: 4.94e-03, grad_scale: 16.0 2023-06-24 09:35:24,527 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.83 vs. limit=12.0 2023-06-24 09:35:31,757 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.06 vs. limit=22.5 2023-06-24 09:35:39,630 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.029e+02 2.601e+02 2.864e+02 3.408e+02 4.627e+02, threshold=5.727e+02, percent-clipped=0.0 2023-06-24 09:36:18,996 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1071978.0, ans=0.125 2023-06-24 09:36:24,194 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1071978.0, ans=0.2 2023-06-24 09:36:28,868 INFO [train.py:996] (0/4) Epoch 6, batch 26200, loss[loss=0.258, simple_loss=0.3558, pruned_loss=0.08009, over 21647.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.3141, pruned_loss=0.08078, over 4278308.36 frames. ], batch size: 414, lr: 4.94e-03, grad_scale: 16.0 2023-06-24 09:36:53,954 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1072038.0, ans=0.125 2023-06-24 09:37:36,464 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1072218.0, ans=0.09899494936611666 2023-06-24 09:37:36,471 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1072218.0, ans=0.1 2023-06-24 09:38:22,433 INFO [train.py:996] (0/4) Epoch 6, batch 26250, loss[loss=0.2182, simple_loss=0.2989, pruned_loss=0.06875, over 16920.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.318, pruned_loss=0.07973, over 4280271.42 frames. ], batch size: 64, lr: 4.94e-03, grad_scale: 16.0 2023-06-24 09:38:51,508 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.45 vs. limit=22.5 2023-06-24 09:39:09,186 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.29 vs. limit=22.5 2023-06-24 09:39:20,864 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.139e+02 2.533e+02 2.809e+02 3.331e+02 4.740e+02, threshold=5.619e+02, percent-clipped=0.0 2023-06-24 09:39:33,113 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.44 vs. limit=15.0 2023-06-24 09:39:45,448 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=16.19 vs. limit=22.5 2023-06-24 09:40:16,017 INFO [train.py:996] (0/4) Epoch 6, batch 26300, loss[loss=0.2096, simple_loss=0.2805, pruned_loss=0.06932, over 21880.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.3145, pruned_loss=0.08053, over 4284412.36 frames. ], batch size: 298, lr: 4.94e-03, grad_scale: 16.0 2023-06-24 09:41:33,601 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.75 vs. limit=15.0 2023-06-24 09:41:33,797 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=1072818.0, ans=15.0 2023-06-24 09:42:05,746 INFO [train.py:996] (0/4) Epoch 6, batch 26350, loss[loss=0.2136, simple_loss=0.2794, pruned_loss=0.07385, over 21217.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3136, pruned_loss=0.08172, over 4284325.33 frames. ], batch size: 608, lr: 4.94e-03, grad_scale: 16.0 2023-06-24 09:42:45,582 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1073058.0, ans=0.0 2023-06-24 09:42:58,168 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.100e+02 2.852e+02 3.248e+02 3.843e+02 6.054e+02, threshold=6.496e+02, percent-clipped=2.0 2023-06-24 09:43:14,612 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff3.min_abs, batch_count=1073118.0, ans=0.2 2023-06-24 09:43:53,730 INFO [train.py:996] (0/4) Epoch 6, batch 26400, loss[loss=0.22, simple_loss=0.2825, pruned_loss=0.07875, over 21813.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.308, pruned_loss=0.08132, over 4277745.85 frames. ], batch size: 98, lr: 4.94e-03, grad_scale: 32.0 2023-06-24 09:45:12,829 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1073418.0, ans=0.0 2023-06-24 09:45:50,256 INFO [train.py:996] (0/4) Epoch 6, batch 26450, loss[loss=0.2205, simple_loss=0.287, pruned_loss=0.07702, over 21184.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3058, pruned_loss=0.08041, over 4273427.60 frames. ], batch size: 159, lr: 4.94e-03, grad_scale: 32.0 2023-06-24 09:45:56,300 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1073538.0, ans=0.2 2023-06-24 09:46:26,846 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1073598.0, ans=0.125 2023-06-24 09:46:49,138 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1073658.0, ans=0.1 2023-06-24 09:46:50,151 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.007e+02 2.812e+02 3.126e+02 4.062e+02 8.206e+02, threshold=6.252e+02, percent-clipped=4.0 2023-06-24 09:46:50,875 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 09:47:05,614 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.03 vs. limit=22.5 2023-06-24 09:47:39,828 INFO [train.py:996] (0/4) Epoch 6, batch 26500, loss[loss=0.2755, simple_loss=0.3539, pruned_loss=0.09852, over 21634.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3065, pruned_loss=0.07965, over 4270654.36 frames. ], batch size: 441, lr: 4.94e-03, grad_scale: 32.0 2023-06-24 09:48:17,153 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.70 vs. limit=6.0 2023-06-24 09:48:25,427 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1073958.0, ans=0.09899494936611666 2023-06-24 09:48:51,745 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.89 vs. limit=12.0 2023-06-24 09:49:10,368 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1074018.0, ans=0.125 2023-06-24 09:49:31,890 INFO [train.py:996] (0/4) Epoch 6, batch 26550, loss[loss=0.1746, simple_loss=0.2465, pruned_loss=0.0514, over 21286.00 frames. ], tot_loss[loss=0.23, simple_loss=0.3047, pruned_loss=0.07763, over 4262987.31 frames. ], batch size: 176, lr: 4.94e-03, grad_scale: 32.0 2023-06-24 09:50:25,656 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 09:50:37,967 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1074258.0, ans=0.125 2023-06-24 09:50:41,386 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1074258.0, ans=0.125 2023-06-24 09:50:42,538 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.986e+02 2.610e+02 3.106e+02 3.674e+02 5.828e+02, threshold=6.212e+02, percent-clipped=0.0 2023-06-24 09:51:21,899 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_na.min_abs, batch_count=1074378.0, ans=0.02 2023-06-24 09:51:26,545 INFO [train.py:996] (0/4) Epoch 6, batch 26600, loss[loss=0.2136, simple_loss=0.304, pruned_loss=0.06161, over 21398.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.3041, pruned_loss=0.07462, over 4264495.17 frames. ], batch size: 211, lr: 4.93e-03, grad_scale: 32.0 2023-06-24 09:52:04,556 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1074498.0, ans=0.125 2023-06-24 09:53:09,688 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=1074678.0, ans=15.0 2023-06-24 09:53:15,492 INFO [train.py:996] (0/4) Epoch 6, batch 26650, loss[loss=0.1587, simple_loss=0.2266, pruned_loss=0.04539, over 21184.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.298, pruned_loss=0.07283, over 4260489.00 frames. ], batch size: 176, lr: 4.93e-03, grad_scale: 32.0 2023-06-24 09:53:52,754 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.53 vs. limit=15.0 2023-06-24 09:54:18,561 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.725e+02 2.261e+02 2.468e+02 2.751e+02 5.054e+02, threshold=4.936e+02, percent-clipped=0.0 2023-06-24 09:54:27,693 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1074918.0, ans=0.0 2023-06-24 09:54:40,776 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1074978.0, ans=10.0 2023-06-24 09:55:03,202 INFO [train.py:996] (0/4) Epoch 6, batch 26700, loss[loss=0.1766, simple_loss=0.2665, pruned_loss=0.04339, over 20794.00 frames. ], tot_loss[loss=0.215, simple_loss=0.2904, pruned_loss=0.06978, over 4263826.59 frames. ], batch size: 609, lr: 4.93e-03, grad_scale: 32.0 2023-06-24 09:55:24,237 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.79 vs. limit=10.0 2023-06-24 09:55:28,994 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1075098.0, ans=0.125 2023-06-24 09:55:58,990 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1075158.0, ans=0.1 2023-06-24 09:56:00,716 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1075158.0, ans=0.1 2023-06-24 09:56:54,978 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1075278.0, ans=0.125 2023-06-24 09:56:59,429 INFO [train.py:996] (0/4) Epoch 6, batch 26750, loss[loss=0.24, simple_loss=0.3122, pruned_loss=0.08387, over 21845.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2919, pruned_loss=0.06973, over 4270734.41 frames. ], batch size: 107, lr: 4.93e-03, grad_scale: 16.0 2023-06-24 09:57:55,232 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.775e+02 2.365e+02 2.700e+02 3.222e+02 4.591e+02, threshold=5.400e+02, percent-clipped=0.0 2023-06-24 09:58:21,528 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1075518.0, ans=0.125 2023-06-24 09:58:43,155 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1075578.0, ans=0.1 2023-06-24 09:58:49,568 INFO [train.py:996] (0/4) Epoch 6, batch 26800, loss[loss=0.2679, simple_loss=0.3324, pruned_loss=0.1017, over 21315.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.3007, pruned_loss=0.07487, over 4272000.18 frames. ], batch size: 176, lr: 4.93e-03, grad_scale: 32.0 2023-06-24 09:59:04,389 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1075638.0, ans=0.0 2023-06-24 09:59:14,770 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1075698.0, ans=0.125 2023-06-24 10:00:18,364 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1075818.0, ans=0.0 2023-06-24 10:00:23,871 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1075878.0, ans=0.09899494936611666 2023-06-24 10:00:43,909 INFO [train.py:996] (0/4) Epoch 6, batch 26850, loss[loss=0.223, simple_loss=0.2982, pruned_loss=0.0739, over 20687.00 frames. ], tot_loss[loss=0.229, simple_loss=0.3026, pruned_loss=0.07772, over 4272982.53 frames. ], batch size: 607, lr: 4.93e-03, grad_scale: 16.0 2023-06-24 10:01:05,562 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.63 vs. limit=15.0 2023-06-24 10:01:12,500 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.00 vs. limit=15.0 2023-06-24 10:01:46,041 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.916e+02 2.745e+02 3.127e+02 3.693e+02 5.292e+02, threshold=6.255e+02, percent-clipped=0.0 2023-06-24 10:01:50,020 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 10:01:58,836 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1076118.0, ans=0.0 2023-06-24 10:02:25,586 INFO [train.py:996] (0/4) Epoch 6, batch 26900, loss[loss=0.1822, simple_loss=0.2452, pruned_loss=0.05963, over 21264.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.2943, pruned_loss=0.07629, over 4271883.14 frames. ], batch size: 177, lr: 4.93e-03, grad_scale: 16.0 2023-06-24 10:02:37,583 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1076238.0, ans=0.0 2023-06-24 10:03:07,904 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1076358.0, ans=0.125 2023-06-24 10:03:07,958 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1076358.0, ans=0.07 2023-06-24 10:03:44,315 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1076418.0, ans=0.125 2023-06-24 10:04:14,909 INFO [train.py:996] (0/4) Epoch 6, batch 26950, loss[loss=0.2745, simple_loss=0.3485, pruned_loss=0.1002, over 21578.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.2919, pruned_loss=0.07578, over 4270264.93 frames. ], batch size: 441, lr: 4.93e-03, grad_scale: 8.0 2023-06-24 10:04:39,057 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.39 vs. limit=15.0 2023-06-24 10:04:51,003 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1076598.0, ans=0.125 2023-06-24 10:05:26,451 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.965e+02 2.499e+02 2.950e+02 4.079e+02 6.623e+02, threshold=5.900e+02, percent-clipped=3.0 2023-06-24 10:06:07,687 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1076778.0, ans=0.0 2023-06-24 10:06:10,649 INFO [train.py:996] (0/4) Epoch 6, batch 27000, loss[loss=0.2003, simple_loss=0.2936, pruned_loss=0.0535, over 20811.00 frames. ], tot_loss[loss=0.219, simple_loss=0.2917, pruned_loss=0.07314, over 4265460.72 frames. ], batch size: 608, lr: 4.93e-03, grad_scale: 8.0 2023-06-24 10:06:10,650 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-24 10:06:28,767 INFO [train.py:1028] (0/4) Epoch 6, validation: loss=0.2519, simple_loss=0.3439, pruned_loss=0.0799, over 1796401.00 frames. 2023-06-24 10:06:28,768 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23616MB 2023-06-24 10:07:09,681 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1076898.0, ans=0.1 2023-06-24 10:08:18,376 INFO [train.py:996] (0/4) Epoch 6, batch 27050, loss[loss=0.2283, simple_loss=0.306, pruned_loss=0.07529, over 21868.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.295, pruned_loss=0.07071, over 4272889.19 frames. ], batch size: 332, lr: 4.93e-03, grad_scale: 8.0 2023-06-24 10:08:20,855 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1077138.0, ans=0.125 2023-06-24 10:08:37,708 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.67 vs. limit=22.5 2023-06-24 10:08:43,821 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1077198.0, ans=0.0 2023-06-24 10:08:45,483 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1077198.0, ans=0.125 2023-06-24 10:09:17,560 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1077258.0, ans=10.0 2023-06-24 10:09:34,310 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.742e+02 2.401e+02 2.781e+02 3.239e+02 4.464e+02, threshold=5.563e+02, percent-clipped=0.0 2023-06-24 10:09:36,943 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1077318.0, ans=0.125 2023-06-24 10:10:08,124 INFO [train.py:996] (0/4) Epoch 6, batch 27100, loss[loss=0.2345, simple_loss=0.3084, pruned_loss=0.08024, over 21837.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.2974, pruned_loss=0.07154, over 4282050.13 frames. ], batch size: 107, lr: 4.93e-03, grad_scale: 8.0 2023-06-24 10:11:57,798 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=1077738.0, ans=15.0 2023-06-24 10:11:58,245 INFO [train.py:996] (0/4) Epoch 6, batch 27150, loss[loss=0.2422, simple_loss=0.321, pruned_loss=0.08165, over 21277.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3073, pruned_loss=0.07477, over 4285764.32 frames. ], batch size: 176, lr: 4.93e-03, grad_scale: 8.0 2023-06-24 10:13:11,112 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.78 vs. limit=15.0 2023-06-24 10:13:13,439 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.037e+02 2.603e+02 2.899e+02 3.318e+02 5.343e+02, threshold=5.797e+02, percent-clipped=0.0 2023-06-24 10:13:27,733 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.52 vs. limit=15.0 2023-06-24 10:13:52,984 INFO [train.py:996] (0/4) Epoch 6, batch 27200, loss[loss=0.2422, simple_loss=0.3205, pruned_loss=0.08197, over 21931.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.316, pruned_loss=0.07787, over 4283021.25 frames. ], batch size: 316, lr: 4.93e-03, grad_scale: 16.0 2023-06-24 10:14:42,246 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=12.00 vs. limit=15.0 2023-06-24 10:15:09,877 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1078218.0, ans=0.125 2023-06-24 10:15:25,439 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.00 vs. limit=10.0 2023-06-24 10:15:48,387 INFO [train.py:996] (0/4) Epoch 6, batch 27250, loss[loss=0.2435, simple_loss=0.3158, pruned_loss=0.08558, over 20688.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.319, pruned_loss=0.08142, over 4280871.86 frames. ], batch size: 607, lr: 4.93e-03, grad_scale: 16.0 2023-06-24 10:15:52,953 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1078338.0, ans=0.1 2023-06-24 10:15:56,973 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.46 vs. limit=15.0 2023-06-24 10:16:38,820 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.73 vs. limit=15.0 2023-06-24 10:16:56,094 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.396e+02 2.982e+02 3.326e+02 3.737e+02 5.172e+02, threshold=6.652e+02, percent-clipped=0.0 2023-06-24 10:16:58,677 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1078518.0, ans=0.0 2023-06-24 10:17:23,744 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1078578.0, ans=0.1 2023-06-24 10:17:32,566 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1078578.0, ans=0.125 2023-06-24 10:17:45,572 INFO [train.py:996] (0/4) Epoch 6, batch 27300, loss[loss=0.2506, simple_loss=0.3285, pruned_loss=0.08633, over 21264.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.3217, pruned_loss=0.08286, over 4276820.34 frames. ], batch size: 159, lr: 4.92e-03, grad_scale: 16.0 2023-06-24 10:17:46,467 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1078638.0, ans=0.125 2023-06-24 10:18:06,599 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.34 vs. limit=15.0 2023-06-24 10:18:53,040 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1078818.0, ans=0.125 2023-06-24 10:19:07,492 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.66 vs. limit=22.5 2023-06-24 10:19:12,309 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.13 vs. limit=22.5 2023-06-24 10:19:33,573 INFO [train.py:996] (0/4) Epoch 6, batch 27350, loss[loss=0.237, simple_loss=0.3306, pruned_loss=0.07166, over 21296.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.325, pruned_loss=0.08327, over 4281497.55 frames. ], batch size: 548, lr: 4.92e-03, grad_scale: 16.0 2023-06-24 10:20:37,096 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.169e+02 2.617e+02 2.947e+02 3.408e+02 6.075e+02, threshold=5.893e+02, percent-clipped=0.0 2023-06-24 10:21:19,467 INFO [train.py:996] (0/4) Epoch 6, batch 27400, loss[loss=0.2302, simple_loss=0.2894, pruned_loss=0.08555, over 21622.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.3198, pruned_loss=0.08285, over 4289853.92 frames. ], batch size: 441, lr: 4.92e-03, grad_scale: 16.0 2023-06-24 10:21:26,358 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.85 vs. limit=22.5 2023-06-24 10:21:43,683 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.60 vs. limit=10.0 2023-06-24 10:22:18,091 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1079358.0, ans=0.1 2023-06-24 10:22:40,624 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1079418.0, ans=0.125 2023-06-24 10:23:06,264 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1079538.0, ans=0.125 2023-06-24 10:23:07,168 INFO [train.py:996] (0/4) Epoch 6, batch 27450, loss[loss=0.231, simple_loss=0.3117, pruned_loss=0.07515, over 21418.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.3133, pruned_loss=0.08095, over 4290205.46 frames. ], batch size: 194, lr: 4.92e-03, grad_scale: 16.0 2023-06-24 10:23:08,326 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.72 vs. limit=15.0 2023-06-24 10:23:53,065 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.95 vs. limit=22.5 2023-06-24 10:24:04,781 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1079718.0, ans=0.125 2023-06-24 10:24:07,321 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.980e+02 2.466e+02 2.775e+02 3.164e+02 4.697e+02, threshold=5.550e+02, percent-clipped=0.0 2023-06-24 10:24:50,408 INFO [train.py:996] (0/4) Epoch 6, batch 27500, loss[loss=0.2194, simple_loss=0.2883, pruned_loss=0.0752, over 21643.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3108, pruned_loss=0.08058, over 4291271.46 frames. ], batch size: 263, lr: 4.92e-03, grad_scale: 16.0 2023-06-24 10:25:06,751 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1079898.0, ans=0.0 2023-06-24 10:25:13,608 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1079898.0, ans=0.0 2023-06-24 10:25:33,749 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.30 vs. limit=15.0 2023-06-24 10:25:38,579 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=1079958.0, ans=10.0 2023-06-24 10:25:41,419 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-180000.pt 2023-06-24 10:25:45,976 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=1079958.0, ans=15.0 2023-06-24 10:26:11,820 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.97 vs. limit=15.0 2023-06-24 10:26:29,662 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1080078.0, ans=0.125 2023-06-24 10:26:34,095 INFO [train.py:996] (0/4) Epoch 6, batch 27550, loss[loss=0.1804, simple_loss=0.2528, pruned_loss=0.05403, over 21640.00 frames. ], tot_loss[loss=0.23, simple_loss=0.3054, pruned_loss=0.07728, over 4292080.35 frames. ], batch size: 298, lr: 4.92e-03, grad_scale: 16.0 2023-06-24 10:26:45,881 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1080138.0, ans=0.125 2023-06-24 10:26:55,728 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1080198.0, ans=0.125 2023-06-24 10:27:43,821 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.907e+02 2.511e+02 2.711e+02 3.223e+02 7.892e+02, threshold=5.422e+02, percent-clipped=3.0 2023-06-24 10:27:54,793 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1080318.0, ans=0.125 2023-06-24 10:28:11,889 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1080378.0, ans=0.125 2023-06-24 10:28:21,570 INFO [train.py:996] (0/4) Epoch 6, batch 27600, loss[loss=0.2081, simple_loss=0.2731, pruned_loss=0.07155, over 21785.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.2983, pruned_loss=0.07607, over 4290775.06 frames. ], batch size: 112, lr: 4.92e-03, grad_scale: 32.0 2023-06-24 10:29:10,644 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1080558.0, ans=0.125 2023-06-24 10:30:08,101 INFO [train.py:996] (0/4) Epoch 6, batch 27650, loss[loss=0.2266, simple_loss=0.3036, pruned_loss=0.07478, over 21841.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.2933, pruned_loss=0.07556, over 4289051.59 frames. ], batch size: 371, lr: 4.92e-03, grad_scale: 32.0 2023-06-24 10:30:10,287 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1080738.0, ans=0.125 2023-06-24 10:30:27,377 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 10:31:02,579 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.53 vs. limit=22.5 2023-06-24 10:31:07,666 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1080858.0, ans=0.125 2023-06-24 10:31:12,208 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.960e+02 2.435e+02 2.709e+02 3.081e+02 4.195e+02, threshold=5.419e+02, percent-clipped=0.0 2023-06-24 10:31:23,564 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1080918.0, ans=0.0 2023-06-24 10:31:49,738 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1080978.0, ans=0.0 2023-06-24 10:31:56,490 INFO [train.py:996] (0/4) Epoch 6, batch 27700, loss[loss=0.2526, simple_loss=0.3222, pruned_loss=0.09148, over 21534.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.2911, pruned_loss=0.07353, over 4281053.49 frames. ], batch size: 471, lr: 4.92e-03, grad_scale: 32.0 2023-06-24 10:32:29,649 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1081098.0, ans=0.1 2023-06-24 10:33:45,302 INFO [train.py:996] (0/4) Epoch 6, batch 27750, loss[loss=0.21, simple_loss=0.2609, pruned_loss=0.07956, over 20190.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.2951, pruned_loss=0.07413, over 4276064.78 frames. ], batch size: 703, lr: 4.92e-03, grad_scale: 32.0 2023-06-24 10:34:22,146 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1081398.0, ans=0.1 2023-06-24 10:34:45,519 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1081458.0, ans=0.0 2023-06-24 10:34:54,054 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1081518.0, ans=0.1 2023-06-24 10:34:55,017 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.871e+02 2.574e+02 2.914e+02 3.859e+02 6.202e+02, threshold=5.827e+02, percent-clipped=2.0 2023-06-24 10:35:01,086 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1081518.0, ans=0.125 2023-06-24 10:35:32,910 INFO [train.py:996] (0/4) Epoch 6, batch 27800, loss[loss=0.2303, simple_loss=0.2971, pruned_loss=0.08172, over 21872.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.2959, pruned_loss=0.07475, over 4287425.72 frames. ], batch size: 371, lr: 4.92e-03, grad_scale: 32.0 2023-06-24 10:35:49,069 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1081698.0, ans=0.125 2023-06-24 10:36:58,290 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1081818.0, ans=0.125 2023-06-24 10:36:59,749 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1081818.0, ans=0.1 2023-06-24 10:37:06,730 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1081878.0, ans=0.125 2023-06-24 10:37:06,777 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1081878.0, ans=0.125 2023-06-24 10:37:09,994 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1081878.0, ans=0.0 2023-06-24 10:37:19,242 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.29 vs. limit=22.5 2023-06-24 10:37:21,726 INFO [train.py:996] (0/4) Epoch 6, batch 27850, loss[loss=0.2433, simple_loss=0.3309, pruned_loss=0.07789, over 21716.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.2967, pruned_loss=0.0753, over 4288793.25 frames. ], batch size: 389, lr: 4.92e-03, grad_scale: 16.0 2023-06-24 10:37:22,548 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1081938.0, ans=0.0 2023-06-24 10:37:30,106 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.94 vs. limit=22.5 2023-06-24 10:37:36,749 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1081938.0, ans=0.0 2023-06-24 10:38:02,165 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1081998.0, ans=0.125 2023-06-24 10:38:16,164 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1082058.0, ans=0.125 2023-06-24 10:38:39,968 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.988e+02 2.601e+02 3.026e+02 3.751e+02 1.054e+03, threshold=6.053e+02, percent-clipped=6.0 2023-06-24 10:39:10,475 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1082238.0, ans=0.2 2023-06-24 10:39:11,466 INFO [train.py:996] (0/4) Epoch 6, batch 27900, loss[loss=0.2163, simple_loss=0.2969, pruned_loss=0.06785, over 21137.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3047, pruned_loss=0.0764, over 4293348.25 frames. ], batch size: 143, lr: 4.92e-03, grad_scale: 16.0 2023-06-24 10:39:37,519 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1082238.0, ans=0.125 2023-06-24 10:39:41,200 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.36 vs. limit=15.0 2023-06-24 10:40:39,453 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1082418.0, ans=0.0 2023-06-24 10:41:13,147 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.13 vs. limit=8.0 2023-06-24 10:41:13,479 INFO [train.py:996] (0/4) Epoch 6, batch 27950, loss[loss=0.2226, simple_loss=0.3094, pruned_loss=0.06788, over 21457.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.3045, pruned_loss=0.0736, over 4291337.78 frames. ], batch size: 131, lr: 4.92e-03, grad_scale: 16.0 2023-06-24 10:42:19,433 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.909e+02 2.526e+02 3.218e+02 4.121e+02 6.447e+02, threshold=6.437e+02, percent-clipped=1.0 2023-06-24 10:42:22,031 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1082718.0, ans=0.0 2023-06-24 10:43:01,520 INFO [train.py:996] (0/4) Epoch 6, batch 28000, loss[loss=0.2212, simple_loss=0.3018, pruned_loss=0.07033, over 21777.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.3029, pruned_loss=0.07173, over 4292216.87 frames. ], batch size: 112, lr: 4.92e-03, grad_scale: 32.0 2023-06-24 10:43:32,060 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1082898.0, ans=0.125 2023-06-24 10:43:50,871 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1082958.0, ans=0.035 2023-06-24 10:44:56,494 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1083138.0, ans=0.125 2023-06-24 10:44:57,538 INFO [train.py:996] (0/4) Epoch 6, batch 28050, loss[loss=0.1823, simple_loss=0.2456, pruned_loss=0.05945, over 21270.00 frames. ], tot_loss[loss=0.2219, simple_loss=0.2996, pruned_loss=0.07211, over 4291266.93 frames. ], batch size: 176, lr: 4.91e-03, grad_scale: 32.0 2023-06-24 10:44:58,122 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1083138.0, ans=0.125 2023-06-24 10:45:15,768 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1083198.0, ans=0.0 2023-06-24 10:45:27,739 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1083198.0, ans=0.0 2023-06-24 10:45:28,310 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.44 vs. limit=22.5 2023-06-24 10:45:34,767 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1083198.0, ans=0.025 2023-06-24 10:45:34,786 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1083198.0, ans=0.0 2023-06-24 10:46:04,464 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.090e+02 2.746e+02 3.083e+02 3.764e+02 7.718e+02, threshold=6.165e+02, percent-clipped=1.0 2023-06-24 10:46:14,157 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1083318.0, ans=0.0 2023-06-24 10:46:23,300 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1083378.0, ans=0.0 2023-06-24 10:46:27,148 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.87 vs. limit=22.5 2023-06-24 10:46:45,936 INFO [train.py:996] (0/4) Epoch 6, batch 28100, loss[loss=0.1849, simple_loss=0.247, pruned_loss=0.0614, over 21500.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.2985, pruned_loss=0.07239, over 4291544.34 frames. ], batch size: 230, lr: 4.91e-03, grad_scale: 32.0 2023-06-24 10:47:00,777 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1083438.0, ans=0.125 2023-06-24 10:47:11,541 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=1083498.0, ans=15.0 2023-06-24 10:47:23,075 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.01 vs. limit=15.0 2023-06-24 10:47:30,960 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1083558.0, ans=0.125 2023-06-24 10:47:37,670 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1083558.0, ans=0.0 2023-06-24 10:48:19,222 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1083678.0, ans=0.2 2023-06-24 10:48:34,045 INFO [train.py:996] (0/4) Epoch 6, batch 28150, loss[loss=0.2398, simple_loss=0.2791, pruned_loss=0.1002, over 21491.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.2926, pruned_loss=0.07201, over 4275950.38 frames. ], batch size: 511, lr: 4.91e-03, grad_scale: 32.0 2023-06-24 10:48:40,421 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.44 vs. limit=15.0 2023-06-24 10:49:10,472 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1083798.0, ans=0.0 2023-06-24 10:49:12,379 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1083858.0, ans=0.95 2023-06-24 10:49:18,537 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.19 vs. limit=10.0 2023-06-24 10:49:20,940 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1083858.0, ans=0.1 2023-06-24 10:49:40,079 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.220e+02 2.816e+02 3.227e+02 4.008e+02 8.112e+02, threshold=6.453e+02, percent-clipped=1.0 2023-06-24 10:50:24,159 INFO [train.py:996] (0/4) Epoch 6, batch 28200, loss[loss=0.2717, simple_loss=0.3337, pruned_loss=0.1048, over 21284.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.2916, pruned_loss=0.07312, over 4262414.78 frames. ], batch size: 143, lr: 4.91e-03, grad_scale: 32.0 2023-06-24 10:50:44,063 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1084038.0, ans=0.125 2023-06-24 10:51:16,867 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1084158.0, ans=0.0 2023-06-24 10:52:04,620 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.92 vs. limit=10.0 2023-06-24 10:52:11,996 INFO [train.py:996] (0/4) Epoch 6, batch 28250, loss[loss=0.2516, simple_loss=0.3219, pruned_loss=0.09067, over 21302.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.294, pruned_loss=0.07536, over 4270548.18 frames. ], batch size: 159, lr: 4.91e-03, grad_scale: 32.0 2023-06-24 10:52:31,726 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1084338.0, ans=0.125 2023-06-24 10:52:38,490 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1084398.0, ans=0.04949747468305833 2023-06-24 10:53:06,732 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.03 vs. limit=10.0 2023-06-24 10:53:10,005 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1084458.0, ans=0.0 2023-06-24 10:53:13,311 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1084458.0, ans=0.125 2023-06-24 10:53:21,718 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1084518.0, ans=0.0 2023-06-24 10:53:30,917 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.085e+02 2.671e+02 3.008e+02 3.478e+02 6.433e+02, threshold=6.015e+02, percent-clipped=0.0 2023-06-24 10:53:36,856 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1084518.0, ans=0.1 2023-06-24 10:54:03,561 INFO [train.py:996] (0/4) Epoch 6, batch 28300, loss[loss=0.1873, simple_loss=0.283, pruned_loss=0.04579, over 21697.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.2923, pruned_loss=0.0739, over 4271961.16 frames. ], batch size: 298, lr: 4.91e-03, grad_scale: 32.0 2023-06-24 10:54:25,291 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1084698.0, ans=0.125 2023-06-24 10:54:35,333 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.41 vs. limit=10.0 2023-06-24 10:54:39,836 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1084698.0, ans=0.125 2023-06-24 10:55:23,266 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1084818.0, ans=0.025 2023-06-24 10:55:41,225 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1084878.0, ans=0.125 2023-06-24 10:55:57,156 INFO [train.py:996] (0/4) Epoch 6, batch 28350, loss[loss=0.186, simple_loss=0.2969, pruned_loss=0.03755, over 20802.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2894, pruned_loss=0.06932, over 4271632.29 frames. ], batch size: 608, lr: 4.91e-03, grad_scale: 32.0 2023-06-24 10:56:21,307 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1084998.0, ans=0.125 2023-06-24 10:57:10,500 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.831e+02 2.270e+02 2.582e+02 2.935e+02 5.064e+02, threshold=5.164e+02, percent-clipped=0.0 2023-06-24 10:57:14,639 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1085118.0, ans=0.125 2023-06-24 10:57:22,493 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.08 vs. limit=15.0 2023-06-24 10:57:46,646 INFO [train.py:996] (0/4) Epoch 6, batch 28400, loss[loss=0.1907, simple_loss=0.2544, pruned_loss=0.06345, over 21552.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.2864, pruned_loss=0.06848, over 4255350.77 frames. ], batch size: 263, lr: 4.91e-03, grad_scale: 32.0 2023-06-24 10:58:29,171 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.76 vs. limit=15.0 2023-06-24 10:58:42,722 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.33 vs. limit=15.0 2023-06-24 10:58:54,316 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1085418.0, ans=0.125 2023-06-24 10:59:04,731 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1085418.0, ans=0.035 2023-06-24 10:59:31,427 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1085478.0, ans=0.125 2023-06-24 10:59:36,153 INFO [train.py:996] (0/4) Epoch 6, batch 28450, loss[loss=0.2321, simple_loss=0.3023, pruned_loss=0.08093, over 21870.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2904, pruned_loss=0.07135, over 4256366.88 frames. ], batch size: 371, lr: 4.91e-03, grad_scale: 32.0 2023-06-24 10:59:36,866 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1085538.0, ans=0.125 2023-06-24 11:00:01,236 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1085598.0, ans=0.125 2023-06-24 11:00:25,480 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1085658.0, ans=0.125 2023-06-24 11:00:39,878 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1085718.0, ans=0.2 2023-06-24 11:00:42,790 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.100e+02 2.782e+02 3.154e+02 3.614e+02 5.624e+02, threshold=6.308e+02, percent-clipped=2.0 2023-06-24 11:00:48,651 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 11:01:25,048 INFO [train.py:996] (0/4) Epoch 6, batch 28500, loss[loss=0.242, simple_loss=0.3225, pruned_loss=0.08075, over 21254.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.2924, pruned_loss=0.07325, over 4260706.78 frames. ], batch size: 143, lr: 4.91e-03, grad_scale: 16.0 2023-06-24 11:02:06,743 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.28 vs. limit=10.0 2023-06-24 11:02:34,919 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1086018.0, ans=0.125 2023-06-24 11:03:12,587 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1086078.0, ans=0.1 2023-06-24 11:03:17,199 INFO [train.py:996] (0/4) Epoch 6, batch 28550, loss[loss=0.3006, simple_loss=0.3979, pruned_loss=0.1016, over 21651.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.3019, pruned_loss=0.07675, over 4270167.25 frames. ], batch size: 414, lr: 4.91e-03, grad_scale: 16.0 2023-06-24 11:03:32,798 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1086138.0, ans=0.125 2023-06-24 11:03:57,513 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1086198.0, ans=0.1 2023-06-24 11:03:59,271 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1086198.0, ans=0.0 2023-06-24 11:04:22,632 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.79 vs. limit=10.0 2023-06-24 11:04:37,436 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.267e+02 2.822e+02 3.220e+02 3.775e+02 6.822e+02, threshold=6.440e+02, percent-clipped=1.0 2023-06-24 11:04:55,535 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.41 vs. limit=15.0 2023-06-24 11:05:12,861 INFO [train.py:996] (0/4) Epoch 6, batch 28600, loss[loss=0.2234, simple_loss=0.3032, pruned_loss=0.07176, over 21414.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.3084, pruned_loss=0.07949, over 4272608.17 frames. ], batch size: 131, lr: 4.91e-03, grad_scale: 16.0 2023-06-24 11:06:35,164 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1086618.0, ans=0.125 2023-06-24 11:06:42,051 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1086678.0, ans=0.0 2023-06-24 11:07:06,282 INFO [train.py:996] (0/4) Epoch 6, batch 28650, loss[loss=0.2021, simple_loss=0.2661, pruned_loss=0.0691, over 21526.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3018, pruned_loss=0.07848, over 4271789.89 frames. ], batch size: 263, lr: 4.91e-03, grad_scale: 16.0 2023-06-24 11:07:39,837 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.34 vs. limit=15.0 2023-06-24 11:07:45,492 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1086858.0, ans=0.0 2023-06-24 11:08:03,470 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1086858.0, ans=0.09899494936611666 2023-06-24 11:08:14,718 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.097e+02 2.693e+02 2.997e+02 3.394e+02 5.567e+02, threshold=5.993e+02, percent-clipped=0.0 2023-06-24 11:08:52,042 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1086978.0, ans=0.125 2023-06-24 11:08:54,851 INFO [train.py:996] (0/4) Epoch 6, batch 28700, loss[loss=0.2463, simple_loss=0.3245, pruned_loss=0.08407, over 21607.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3008, pruned_loss=0.07916, over 4265101.70 frames. ], batch size: 389, lr: 4.91e-03, grad_scale: 16.0 2023-06-24 11:09:23,803 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1087098.0, ans=0.125 2023-06-24 11:09:59,205 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1087218.0, ans=0.2 2023-06-24 11:10:08,554 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=12.35 vs. limit=15.0 2023-06-24 11:10:26,064 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1087278.0, ans=0.1 2023-06-24 11:10:36,693 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1087278.0, ans=0.125 2023-06-24 11:10:44,942 INFO [train.py:996] (0/4) Epoch 6, batch 28750, loss[loss=0.2271, simple_loss=0.2909, pruned_loss=0.08166, over 20707.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.302, pruned_loss=0.07969, over 4267584.67 frames. ], batch size: 607, lr: 4.91e-03, grad_scale: 16.0 2023-06-24 11:10:50,657 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1087338.0, ans=0.035 2023-06-24 11:11:29,949 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.59 vs. limit=15.0 2023-06-24 11:11:38,018 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1087458.0, ans=0.1 2023-06-24 11:11:53,475 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.016e+02 2.664e+02 3.026e+02 3.382e+02 4.910e+02, threshold=6.051e+02, percent-clipped=0.0 2023-06-24 11:11:55,805 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1087518.0, ans=0.1 2023-06-24 11:12:18,293 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1087578.0, ans=0.125 2023-06-24 11:12:21,726 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1087578.0, ans=0.0 2023-06-24 11:12:25,112 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1087578.0, ans=0.1 2023-06-24 11:12:33,143 INFO [train.py:996] (0/4) Epoch 6, batch 28800, loss[loss=0.2599, simple_loss=0.3262, pruned_loss=0.09677, over 21763.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3053, pruned_loss=0.07996, over 4271007.89 frames. ], batch size: 298, lr: 4.90e-03, grad_scale: 32.0 2023-06-24 11:13:14,196 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1087758.0, ans=0.0 2023-06-24 11:13:31,605 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1087818.0, ans=0.125 2023-06-24 11:13:36,986 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1087818.0, ans=0.1 2023-06-24 11:13:48,841 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.44 vs. limit=12.0 2023-06-24 11:13:55,614 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.94 vs. limit=15.0 2023-06-24 11:14:11,771 INFO [train.py:996] (0/4) Epoch 6, batch 28850, loss[loss=0.2176, simple_loss=0.2879, pruned_loss=0.07367, over 21809.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3068, pruned_loss=0.08161, over 4279953.67 frames. ], batch size: 298, lr: 4.90e-03, grad_scale: 32.0 2023-06-24 11:15:25,931 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.267e+02 2.795e+02 3.097e+02 3.558e+02 6.026e+02, threshold=6.195e+02, percent-clipped=0.0 2023-06-24 11:15:41,008 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1088118.0, ans=0.125 2023-06-24 11:15:44,574 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1088178.0, ans=0.0 2023-06-24 11:15:55,372 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1088178.0, ans=0.1 2023-06-24 11:16:01,680 INFO [train.py:996] (0/4) Epoch 6, batch 28900, loss[loss=0.2309, simple_loss=0.2998, pruned_loss=0.08099, over 21690.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.3102, pruned_loss=0.08314, over 4285201.31 frames. ], batch size: 230, lr: 4.90e-03, grad_scale: 32.0 2023-06-24 11:18:05,893 INFO [train.py:996] (0/4) Epoch 6, batch 28950, loss[loss=0.1799, simple_loss=0.2397, pruned_loss=0.06003, over 21289.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3099, pruned_loss=0.08246, over 4275307.52 frames. ], batch size: 159, lr: 4.90e-03, grad_scale: 32.0 2023-06-24 11:18:52,388 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1088658.0, ans=0.125 2023-06-24 11:19:18,000 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.152e+02 3.066e+02 3.487e+02 4.356e+02 7.485e+02, threshold=6.974e+02, percent-clipped=4.0 2023-06-24 11:19:47,051 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1088778.0, ans=0.125 2023-06-24 11:19:57,039 INFO [train.py:996] (0/4) Epoch 6, batch 29000, loss[loss=0.2669, simple_loss=0.3454, pruned_loss=0.09421, over 21800.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3125, pruned_loss=0.08072, over 4272218.90 frames. ], batch size: 124, lr: 4.90e-03, grad_scale: 16.0 2023-06-24 11:21:24,514 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1089078.0, ans=0.0 2023-06-24 11:21:39,610 INFO [train.py:996] (0/4) Epoch 6, batch 29050, loss[loss=0.2203, simple_loss=0.2864, pruned_loss=0.07711, over 21871.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3119, pruned_loss=0.08074, over 4273033.80 frames. ], batch size: 298, lr: 4.90e-03, grad_scale: 16.0 2023-06-24 11:22:06,799 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.82 vs. limit=15.0 2023-06-24 11:22:33,085 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1089258.0, ans=0.125 2023-06-24 11:22:58,764 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 11:22:59,676 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.070e+02 2.602e+02 2.960e+02 3.468e+02 4.732e+02, threshold=5.920e+02, percent-clipped=0.0 2023-06-24 11:23:01,985 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1089318.0, ans=0.1 2023-06-24 11:23:27,552 INFO [train.py:996] (0/4) Epoch 6, batch 29100, loss[loss=0.1702, simple_loss=0.2404, pruned_loss=0.05003, over 21605.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3032, pruned_loss=0.07855, over 4265763.03 frames. ], batch size: 298, lr: 4.90e-03, grad_scale: 16.0 2023-06-24 11:24:22,031 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1089558.0, ans=0.05 2023-06-24 11:24:27,142 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1089558.0, ans=0.125 2023-06-24 11:24:48,306 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.00 vs. limit=22.5 2023-06-24 11:25:01,255 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1089678.0, ans=0.2 2023-06-24 11:25:08,081 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1089678.0, ans=0.0 2023-06-24 11:25:10,507 INFO [train.py:996] (0/4) Epoch 6, batch 29150, loss[loss=0.2252, simple_loss=0.3151, pruned_loss=0.06767, over 21332.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3018, pruned_loss=0.0772, over 4261176.39 frames. ], batch size: 176, lr: 4.90e-03, grad_scale: 16.0 2023-06-24 11:25:51,881 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1089798.0, ans=0.125 2023-06-24 11:25:57,363 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.01 vs. limit=22.5 2023-06-24 11:26:20,324 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.89 vs. limit=15.0 2023-06-24 11:26:30,813 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.964e+02 2.501e+02 2.832e+02 3.252e+02 5.475e+02, threshold=5.663e+02, percent-clipped=0.0 2023-06-24 11:26:39,925 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1089978.0, ans=0.125 2023-06-24 11:26:40,615 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.08 vs. limit=15.0 2023-06-24 11:26:58,359 INFO [train.py:996] (0/4) Epoch 6, batch 29200, loss[loss=0.2011, simple_loss=0.2655, pruned_loss=0.06834, over 21233.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.2983, pruned_loss=0.07681, over 4265529.82 frames. ], batch size: 144, lr: 4.90e-03, grad_scale: 32.0 2023-06-24 11:27:12,533 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1090038.0, ans=0.0 2023-06-24 11:27:21,828 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1090098.0, ans=0.07 2023-06-24 11:27:31,184 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.91 vs. limit=6.0 2023-06-24 11:27:55,463 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1090158.0, ans=0.125 2023-06-24 11:28:09,149 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1090158.0, ans=0.1 2023-06-24 11:28:44,290 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1090278.0, ans=0.125 2023-06-24 11:28:47,246 INFO [train.py:996] (0/4) Epoch 6, batch 29250, loss[loss=0.2309, simple_loss=0.3191, pruned_loss=0.07136, over 21837.00 frames. ], tot_loss[loss=0.222, simple_loss=0.2955, pruned_loss=0.07422, over 4259425.44 frames. ], batch size: 317, lr: 4.90e-03, grad_scale: 32.0 2023-06-24 11:29:00,395 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1090338.0, ans=0.125 2023-06-24 11:29:02,428 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1090338.0, ans=0.2 2023-06-24 11:29:18,163 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1090398.0, ans=0.125 2023-06-24 11:29:40,974 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1090458.0, ans=0.125 2023-06-24 11:30:08,094 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.954e+02 2.479e+02 2.949e+02 4.059e+02 6.998e+02, threshold=5.898e+02, percent-clipped=9.0 2023-06-24 11:30:29,685 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=1090578.0, ans=0.025 2023-06-24 11:30:41,000 INFO [train.py:996] (0/4) Epoch 6, batch 29300, loss[loss=0.2196, simple_loss=0.3034, pruned_loss=0.0679, over 21692.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.2977, pruned_loss=0.07335, over 4266867.75 frames. ], batch size: 351, lr: 4.90e-03, grad_scale: 32.0 2023-06-24 11:30:42,527 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.71 vs. limit=22.5 2023-06-24 11:32:00,294 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.93 vs. limit=15.0 2023-06-24 11:32:31,132 INFO [train.py:996] (0/4) Epoch 6, batch 29350, loss[loss=0.1961, simple_loss=0.2631, pruned_loss=0.06456, over 21721.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2921, pruned_loss=0.07272, over 4254832.50 frames. ], batch size: 351, lr: 4.90e-03, grad_scale: 32.0 2023-06-24 11:33:16,421 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.93 vs. limit=6.0 2023-06-24 11:33:24,598 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1091058.0, ans=0.0 2023-06-24 11:33:34,947 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1091118.0, ans=0.0 2023-06-24 11:33:43,342 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.001e+02 2.584e+02 3.038e+02 3.610e+02 5.891e+02, threshold=6.076e+02, percent-clipped=0.0 2023-06-24 11:34:20,366 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.10 vs. limit=22.5 2023-06-24 11:34:20,413 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.84 vs. limit=22.5 2023-06-24 11:34:22,816 INFO [train.py:996] (0/4) Epoch 6, batch 29400, loss[loss=0.1901, simple_loss=0.2665, pruned_loss=0.05686, over 21582.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2932, pruned_loss=0.07139, over 4257267.63 frames. ], batch size: 263, lr: 4.90e-03, grad_scale: 32.0 2023-06-24 11:36:06,008 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1091478.0, ans=0.1 2023-06-24 11:36:09,685 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1091478.0, ans=0.0 2023-06-24 11:36:11,611 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1091538.0, ans=0.2 2023-06-24 11:36:12,538 INFO [train.py:996] (0/4) Epoch 6, batch 29450, loss[loss=0.2451, simple_loss=0.3171, pruned_loss=0.08655, over 20722.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2917, pruned_loss=0.07128, over 4259627.77 frames. ], batch size: 609, lr: 4.90e-03, grad_scale: 32.0 2023-06-24 11:36:43,825 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1091598.0, ans=0.0 2023-06-24 11:37:16,131 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.53 vs. limit=15.0 2023-06-24 11:37:26,463 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.944e+02 2.589e+02 2.908e+02 3.358e+02 5.330e+02, threshold=5.817e+02, percent-clipped=0.0 2023-06-24 11:38:00,154 INFO [train.py:996] (0/4) Epoch 6, batch 29500, loss[loss=0.2411, simple_loss=0.3049, pruned_loss=0.08869, over 21813.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.297, pruned_loss=0.07481, over 4262854.59 frames. ], batch size: 441, lr: 4.90e-03, grad_scale: 32.0 2023-06-24 11:38:02,725 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff2.min_abs, batch_count=1091838.0, ans=0.1 2023-06-24 11:38:07,877 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1091838.0, ans=0.1 2023-06-24 11:38:12,872 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1091838.0, ans=0.2 2023-06-24 11:38:18,569 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.55 vs. limit=15.0 2023-06-24 11:38:39,299 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1091898.0, ans=0.0 2023-06-24 11:38:53,154 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1091958.0, ans=0.95 2023-06-24 11:39:34,691 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1092078.0, ans=0.1 2023-06-24 11:39:39,859 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1092078.0, ans=0.125 2023-06-24 11:39:43,488 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1092078.0, ans=0.2 2023-06-24 11:39:49,846 INFO [train.py:996] (0/4) Epoch 6, batch 29550, loss[loss=0.2171, simple_loss=0.2817, pruned_loss=0.07627, over 21618.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.2965, pruned_loss=0.07645, over 4276694.27 frames. ], batch size: 212, lr: 4.89e-03, grad_scale: 32.0 2023-06-24 11:40:01,802 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.60 vs. limit=15.0 2023-06-24 11:40:13,141 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1092198.0, ans=0.2 2023-06-24 11:40:22,255 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1092198.0, ans=0.0 2023-06-24 11:40:23,893 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1092198.0, ans=0.125 2023-06-24 11:41:03,033 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.75 vs. limit=22.5 2023-06-24 11:41:05,564 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.190e+02 2.863e+02 3.307e+02 3.931e+02 5.796e+02, threshold=6.614e+02, percent-clipped=0.0 2023-06-24 11:41:25,090 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.82 vs. limit=15.0 2023-06-24 11:41:27,822 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1092378.0, ans=0.0 2023-06-24 11:41:39,662 INFO [train.py:996] (0/4) Epoch 6, batch 29600, loss[loss=0.2054, simple_loss=0.2543, pruned_loss=0.0783, over 20361.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3022, pruned_loss=0.07851, over 4285136.81 frames. ], batch size: 703, lr: 4.89e-03, grad_scale: 32.0 2023-06-24 11:41:52,588 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1092438.0, ans=0.125 2023-06-24 11:42:33,781 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1092558.0, ans=0.0 2023-06-24 11:42:46,000 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1092618.0, ans=0.0 2023-06-24 11:43:26,414 INFO [train.py:996] (0/4) Epoch 6, batch 29650, loss[loss=0.1894, simple_loss=0.2588, pruned_loss=0.06007, over 21188.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.2995, pruned_loss=0.07508, over 4279619.23 frames. ], batch size: 159, lr: 4.89e-03, grad_scale: 32.0 2023-06-24 11:43:36,172 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1092738.0, ans=0.0 2023-06-24 11:43:41,928 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.05 vs. limit=15.0 2023-06-24 11:44:47,014 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.627e+02 2.546e+02 3.028e+02 3.755e+02 5.764e+02, threshold=6.055e+02, percent-clipped=0.0 2023-06-24 11:44:54,618 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1092918.0, ans=0.125 2023-06-24 11:45:14,616 INFO [train.py:996] (0/4) Epoch 6, batch 29700, loss[loss=0.3167, simple_loss=0.4073, pruned_loss=0.1131, over 21530.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.3041, pruned_loss=0.07619, over 4280785.82 frames. ], batch size: 471, lr: 4.89e-03, grad_scale: 16.0 2023-06-24 11:46:32,758 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1093218.0, ans=0.07 2023-06-24 11:46:33,206 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.88 vs. limit=15.0 2023-06-24 11:46:37,660 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1093218.0, ans=0.2 2023-06-24 11:46:57,829 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1093278.0, ans=0.2 2023-06-24 11:47:02,452 INFO [train.py:996] (0/4) Epoch 6, batch 29750, loss[loss=0.3102, simple_loss=0.3784, pruned_loss=0.121, over 21537.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3104, pruned_loss=0.07671, over 4282142.63 frames. ], batch size: 507, lr: 4.89e-03, grad_scale: 16.0 2023-06-24 11:47:11,975 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1093338.0, ans=0.2 2023-06-24 11:48:01,119 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1093458.0, ans=0.2 2023-06-24 11:48:01,148 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1093458.0, ans=0.1 2023-06-24 11:48:23,013 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.842e+02 2.425e+02 2.693e+02 3.074e+02 5.352e+02, threshold=5.385e+02, percent-clipped=0.0 2023-06-24 11:48:35,822 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1093578.0, ans=0.1 2023-06-24 11:48:54,214 INFO [train.py:996] (0/4) Epoch 6, batch 29800, loss[loss=0.2271, simple_loss=0.2922, pruned_loss=0.08099, over 21574.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3101, pruned_loss=0.0768, over 4287536.58 frames. ], batch size: 548, lr: 4.89e-03, grad_scale: 16.0 2023-06-24 11:49:29,588 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1093698.0, ans=0.0 2023-06-24 11:49:35,563 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.92 vs. limit=8.0 2023-06-24 11:49:58,768 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1093758.0, ans=0.125 2023-06-24 11:50:34,861 INFO [train.py:996] (0/4) Epoch 6, batch 29850, loss[loss=0.1966, simple_loss=0.2777, pruned_loss=0.05778, over 21745.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.305, pruned_loss=0.07418, over 4284865.94 frames. ], batch size: 414, lr: 4.89e-03, grad_scale: 16.0 2023-06-24 11:51:30,093 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1094058.0, ans=0.5 2023-06-24 11:51:55,460 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.130e+02 2.491e+02 2.734e+02 3.399e+02 8.130e+02, threshold=5.469e+02, percent-clipped=4.0 2023-06-24 11:51:57,610 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1094118.0, ans=0.125 2023-06-24 11:51:58,216 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.22 vs. limit=10.0 2023-06-24 11:52:01,329 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1094118.0, ans=0.2 2023-06-24 11:52:22,144 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1094178.0, ans=0.05 2023-06-24 11:52:26,442 INFO [train.py:996] (0/4) Epoch 6, batch 29900, loss[loss=0.244, simple_loss=0.3142, pruned_loss=0.08692, over 21880.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.3042, pruned_loss=0.07533, over 4284252.34 frames. ], batch size: 371, lr: 4.89e-03, grad_scale: 16.0 2023-06-24 11:52:39,882 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1094238.0, ans=0.0 2023-06-24 11:52:59,756 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.63 vs. limit=22.5 2023-06-24 11:53:26,681 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1094358.0, ans=0.0 2023-06-24 11:53:28,289 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1094358.0, ans=0.025 2023-06-24 11:53:31,985 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1094358.0, ans=0.2 2023-06-24 11:53:31,990 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1094358.0, ans=0.125 2023-06-24 11:54:18,468 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1094478.0, ans=0.125 2023-06-24 11:54:23,089 INFO [train.py:996] (0/4) Epoch 6, batch 29950, loss[loss=0.238, simple_loss=0.3115, pruned_loss=0.08218, over 21730.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.3069, pruned_loss=0.07887, over 4285488.93 frames. ], batch size: 298, lr: 4.89e-03, grad_scale: 16.0 2023-06-24 11:54:30,860 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1094538.0, ans=0.0 2023-06-24 11:55:21,091 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1094658.0, ans=0.125 2023-06-24 11:55:41,531 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.234e+02 2.840e+02 3.123e+02 3.616e+02 5.024e+02, threshold=6.246e+02, percent-clipped=0.0 2023-06-24 11:56:07,525 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1094778.0, ans=0.125 2023-06-24 11:56:09,077 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_na.min_abs, batch_count=1094778.0, ans=0.02 2023-06-24 11:56:13,765 INFO [train.py:996] (0/4) Epoch 6, batch 30000, loss[loss=0.2045, simple_loss=0.2951, pruned_loss=0.05697, over 21616.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3071, pruned_loss=0.07805, over 4281334.93 frames. ], batch size: 230, lr: 4.89e-03, grad_scale: 32.0 2023-06-24 11:56:13,766 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-24 11:56:28,907 INFO [zipformer.py:1728] (0/4) name=encoder.encoders.2.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([2.0818, 4.4705, 4.2276, 4.4763], device='cuda:0') 2023-06-24 11:56:34,161 INFO [train.py:1028] (0/4) Epoch 6, validation: loss=0.2459, simple_loss=0.3437, pruned_loss=0.07409, over 1796401.00 frames. 2023-06-24 11:56:34,162 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23616MB 2023-06-24 11:57:07,060 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1094898.0, ans=0.1 2023-06-24 11:57:51,777 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1095018.0, ans=0.0 2023-06-24 11:57:55,966 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1095018.0, ans=0.0 2023-06-24 11:58:03,209 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1095018.0, ans=0.125 2023-06-24 11:58:26,784 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.17 vs. limit=15.0 2023-06-24 11:58:36,161 INFO [train.py:996] (0/4) Epoch 6, batch 30050, loss[loss=0.2808, simple_loss=0.3913, pruned_loss=0.08515, over 21622.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3111, pruned_loss=0.07572, over 4273703.41 frames. ], batch size: 441, lr: 4.89e-03, grad_scale: 32.0 2023-06-24 11:59:07,508 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.77 vs. limit=15.0 2023-06-24 11:59:19,818 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1095198.0, ans=0.0 2023-06-24 11:59:52,674 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1095318.0, ans=0.025 2023-06-24 11:59:55,138 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.808e+02 2.460e+02 2.888e+02 3.811e+02 6.345e+02, threshold=5.776e+02, percent-clipped=1.0 2023-06-24 12:00:06,782 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1095378.0, ans=0.2 2023-06-24 12:00:24,971 INFO [train.py:996] (0/4) Epoch 6, batch 30100, loss[loss=0.194, simple_loss=0.262, pruned_loss=0.06301, over 21732.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3089, pruned_loss=0.07496, over 4271390.14 frames. ], batch size: 112, lr: 4.89e-03, grad_scale: 16.0 2023-06-24 12:01:36,347 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1095618.0, ans=0.04949747468305833 2023-06-24 12:02:15,862 INFO [train.py:996] (0/4) Epoch 6, batch 30150, loss[loss=0.1902, simple_loss=0.2393, pruned_loss=0.07056, over 20757.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3053, pruned_loss=0.07619, over 4271348.90 frames. ], batch size: 609, lr: 4.89e-03, grad_scale: 16.0 2023-06-24 12:02:42,915 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.63 vs. limit=10.0 2023-06-24 12:02:52,051 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1095798.0, ans=10.0 2023-06-24 12:02:54,004 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1095798.0, ans=0.1 2023-06-24 12:03:31,425 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.18 vs. limit=22.5 2023-06-24 12:03:41,167 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1095918.0, ans=0.1 2023-06-24 12:03:44,052 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.176e+02 2.662e+02 2.970e+02 3.572e+02 6.402e+02, threshold=5.941e+02, percent-clipped=1.0 2023-06-24 12:03:52,133 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1095978.0, ans=0.2 2023-06-24 12:04:19,441 INFO [train.py:996] (0/4) Epoch 6, batch 30200, loss[loss=0.2112, simple_loss=0.3101, pruned_loss=0.05613, over 21706.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3065, pruned_loss=0.07563, over 4264299.47 frames. ], batch size: 298, lr: 4.89e-03, grad_scale: 16.0 2023-06-24 12:06:10,623 INFO [train.py:996] (0/4) Epoch 6, batch 30250, loss[loss=0.2493, simple_loss=0.36, pruned_loss=0.06936, over 21627.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3143, pruned_loss=0.07787, over 4267554.15 frames. ], batch size: 263, lr: 4.89e-03, grad_scale: 16.0 2023-06-24 12:06:40,958 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_ff2.min_abs, batch_count=1096398.0, ans=0.1 2023-06-24 12:06:44,569 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1096398.0, ans=0.0 2023-06-24 12:06:51,665 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1096398.0, ans=0.125 2023-06-24 12:06:56,748 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1096458.0, ans=0.2 2023-06-24 12:07:12,670 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.17 vs. limit=15.0 2023-06-24 12:07:13,985 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.78 vs. limit=15.0 2023-06-24 12:07:27,928 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.281e+02 2.716e+02 3.093e+02 3.619e+02 5.439e+02, threshold=6.186e+02, percent-clipped=0.0 2023-06-24 12:07:57,897 INFO [train.py:996] (0/4) Epoch 6, batch 30300, loss[loss=0.1951, simple_loss=0.2557, pruned_loss=0.06728, over 21270.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3124, pruned_loss=0.07822, over 4271938.26 frames. ], batch size: 159, lr: 4.88e-03, grad_scale: 16.0 2023-06-24 12:08:16,072 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1096638.0, ans=0.125 2023-06-24 12:08:16,814 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.95 vs. limit=15.0 2023-06-24 12:08:21,878 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.80 vs. limit=15.0 2023-06-24 12:08:25,011 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1096698.0, ans=0.125 2023-06-24 12:08:44,867 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1096758.0, ans=0.2 2023-06-24 12:09:54,015 INFO [train.py:996] (0/4) Epoch 6, batch 30350, loss[loss=0.2158, simple_loss=0.2564, pruned_loss=0.08757, over 20238.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3117, pruned_loss=0.07879, over 4262962.71 frames. ], batch size: 707, lr: 4.88e-03, grad_scale: 16.0 2023-06-24 12:10:18,876 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1096998.0, ans=0.1 2023-06-24 12:10:22,522 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.39 vs. limit=6.0 2023-06-24 12:10:25,607 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.94 vs. limit=15.0 2023-06-24 12:10:56,364 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.042e+02 2.694e+02 3.043e+02 3.524e+02 5.331e+02, threshold=6.085e+02, percent-clipped=0.0 2023-06-24 12:11:26,986 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1097238.0, ans=0.0 2023-06-24 12:11:27,899 INFO [train.py:996] (0/4) Epoch 6, batch 30400, loss[loss=0.2307, simple_loss=0.2714, pruned_loss=0.09503, over 20250.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.3065, pruned_loss=0.07767, over 4247734.34 frames. ], batch size: 703, lr: 4.88e-03, grad_scale: 32.0 2023-06-24 12:11:34,891 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1097238.0, ans=0.0 2023-06-24 12:11:38,352 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.29 vs. limit=15.0 2023-06-24 12:11:49,092 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1097298.0, ans=0.1 2023-06-24 12:12:09,872 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 12:12:17,009 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.93 vs. limit=6.0 2023-06-24 12:12:17,904 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1097418.0, ans=0.2 2023-06-24 12:12:42,246 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1097478.0, ans=0.1 2023-06-24 12:12:57,187 INFO [train.py:996] (0/4) Epoch 6, batch 30450, loss[loss=0.2836, simple_loss=0.3968, pruned_loss=0.08525, over 19768.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3077, pruned_loss=0.07796, over 4191582.30 frames. ], batch size: 702, lr: 4.88e-03, grad_scale: 32.0 2023-06-24 12:13:47,209 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1097718.0, ans=0.025 2023-06-24 12:13:54,081 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1097718.0, ans=0.125 2023-06-24 12:13:56,613 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.978e+02 4.419e+02 5.663e+02 8.899e+02 2.204e+03, threshold=1.133e+03, percent-clipped=46.0 2023-06-24 12:14:01,251 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.80 vs. limit=15.0 2023-06-24 12:14:08,252 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/epoch-6.pt 2023-06-24 12:16:21,111 INFO [train.py:996] (0/4) Epoch 7, batch 0, loss[loss=0.2325, simple_loss=0.3039, pruned_loss=0.08056, over 21935.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3039, pruned_loss=0.08056, over 21935.00 frames. ], batch size: 113, lr: 4.48e-03, grad_scale: 32.0 2023-06-24 12:16:21,113 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-24 12:16:38,597 INFO [train.py:1028] (0/4) Epoch 7, validation: loss=0.2421, simple_loss=0.346, pruned_loss=0.0691, over 1796401.00 frames. 2023-06-24 12:16:38,598 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23616MB 2023-06-24 12:17:01,585 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.27 vs. limit=15.0 2023-06-24 12:17:25,395 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1097862.0, ans=0.0 2023-06-24 12:18:05,014 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1097982.0, ans=0.0 2023-06-24 12:18:25,401 INFO [train.py:996] (0/4) Epoch 7, batch 50, loss[loss=0.2462, simple_loss=0.3469, pruned_loss=0.07275, over 21259.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.3066, pruned_loss=0.07531, over 967825.49 frames. ], batch size: 143, lr: 4.48e-03, grad_scale: 32.0 2023-06-24 12:18:57,971 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1098162.0, ans=0.0 2023-06-24 12:19:23,059 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.49 vs. limit=6.0 2023-06-24 12:19:23,119 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.14 vs. limit=22.5 2023-06-24 12:19:29,061 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1098222.0, ans=0.125 2023-06-24 12:20:01,487 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.157e+02 2.689e+02 3.085e+02 3.734e+02 9.044e+02, threshold=6.169e+02, percent-clipped=0.0 2023-06-24 12:20:13,724 INFO [train.py:996] (0/4) Epoch 7, batch 100, loss[loss=0.2473, simple_loss=0.3324, pruned_loss=0.08108, over 21620.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.3251, pruned_loss=0.07829, over 1707293.49 frames. ], batch size: 389, lr: 4.48e-03, grad_scale: 16.0 2023-06-24 12:20:51,996 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1098462.0, ans=0.2 2023-06-24 12:21:03,065 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.52 vs. limit=22.5 2023-06-24 12:21:54,416 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1098642.0, ans=0.1 2023-06-24 12:22:00,449 INFO [train.py:996] (0/4) Epoch 7, batch 150, loss[loss=0.2558, simple_loss=0.3503, pruned_loss=0.08067, over 21655.00 frames. ], tot_loss[loss=0.241, simple_loss=0.3253, pruned_loss=0.07829, over 2278329.80 frames. ], batch size: 414, lr: 4.48e-03, grad_scale: 16.0 2023-06-24 12:22:09,604 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1098702.0, ans=0.0 2023-06-24 12:22:14,364 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1098702.0, ans=0.0 2023-06-24 12:22:57,296 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1098822.0, ans=0.125 2023-06-24 12:23:36,475 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.136e+02 2.604e+02 2.896e+02 3.363e+02 6.379e+02, threshold=5.792e+02, percent-clipped=1.0 2023-06-24 12:23:40,516 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1098942.0, ans=0.125 2023-06-24 12:23:46,876 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1099002.0, ans=0.125 2023-06-24 12:23:47,921 INFO [train.py:996] (0/4) Epoch 7, batch 200, loss[loss=0.2462, simple_loss=0.3454, pruned_loss=0.07354, over 21687.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.3256, pruned_loss=0.07874, over 2715932.04 frames. ], batch size: 414, lr: 4.48e-03, grad_scale: 16.0 2023-06-24 12:24:06,822 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.03 vs. limit=15.0 2023-06-24 12:24:16,883 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.99 vs. limit=6.0 2023-06-24 12:24:49,223 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1099122.0, ans=0.1 2023-06-24 12:25:24,946 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1099242.0, ans=0.125 2023-06-24 12:25:36,664 INFO [train.py:996] (0/4) Epoch 7, batch 250, loss[loss=0.2004, simple_loss=0.2891, pruned_loss=0.05587, over 21721.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.3205, pruned_loss=0.07737, over 3057238.19 frames. ], batch size: 298, lr: 4.48e-03, grad_scale: 16.0 2023-06-24 12:26:30,976 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1099422.0, ans=0.025 2023-06-24 12:26:39,544 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=1099422.0, ans=0.05 2023-06-24 12:26:42,000 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=9.93 vs. limit=22.5 2023-06-24 12:27:14,904 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.025e+02 2.504e+02 2.848e+02 3.185e+02 4.478e+02, threshold=5.696e+02, percent-clipped=0.0 2023-06-24 12:27:27,344 INFO [train.py:996] (0/4) Epoch 7, batch 300, loss[loss=0.2318, simple_loss=0.2979, pruned_loss=0.0828, over 21446.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3142, pruned_loss=0.07721, over 3328996.86 frames. ], batch size: 211, lr: 4.48e-03, grad_scale: 16.0 2023-06-24 12:27:49,947 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1099662.0, ans=0.2 2023-06-24 12:28:04,192 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1099662.0, ans=0.0 2023-06-24 12:28:17,046 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1099662.0, ans=0.0 2023-06-24 12:28:29,241 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1099722.0, ans=0.1 2023-06-24 12:29:18,770 INFO [train.py:996] (0/4) Epoch 7, batch 350, loss[loss=0.2335, simple_loss=0.2755, pruned_loss=0.09575, over 21439.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.309, pruned_loss=0.07711, over 3541039.61 frames. ], batch size: 511, lr: 4.48e-03, grad_scale: 16.0 2023-06-24 12:29:59,510 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1099962.0, ans=0.0 2023-06-24 12:30:00,163 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.28 vs. limit=15.0 2023-06-24 12:30:29,208 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1100082.0, ans=0.2 2023-06-24 12:30:43,344 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1100082.0, ans=0.0 2023-06-24 12:30:48,794 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1100082.0, ans=0.1 2023-06-24 12:30:58,784 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.904e+02 2.718e+02 3.112e+02 3.692e+02 6.265e+02, threshold=6.224e+02, percent-clipped=2.0 2023-06-24 12:31:11,304 INFO [train.py:996] (0/4) Epoch 7, batch 400, loss[loss=0.2368, simple_loss=0.3041, pruned_loss=0.08469, over 21613.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3039, pruned_loss=0.07588, over 3708438.17 frames. ], batch size: 391, lr: 4.48e-03, grad_scale: 32.0 2023-06-24 12:31:54,634 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1100262.0, ans=0.1 2023-06-24 12:32:30,781 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.53 vs. limit=15.0 2023-06-24 12:32:35,769 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1100382.0, ans=0.2 2023-06-24 12:33:02,128 INFO [train.py:996] (0/4) Epoch 7, batch 450, loss[loss=0.2219, simple_loss=0.2909, pruned_loss=0.07644, over 21964.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.3, pruned_loss=0.07391, over 3830982.10 frames. ], batch size: 316, lr: 4.48e-03, grad_scale: 16.0 2023-06-24 12:34:40,556 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.890e+02 2.615e+02 3.361e+02 4.061e+02 5.988e+02, threshold=6.722e+02, percent-clipped=0.0 2023-06-24 12:34:57,210 INFO [train.py:996] (0/4) Epoch 7, batch 500, loss[loss=0.2171, simple_loss=0.3012, pruned_loss=0.06644, over 21323.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.2984, pruned_loss=0.07317, over 3931048.27 frames. ], batch size: 176, lr: 4.48e-03, grad_scale: 16.0 2023-06-24 12:36:46,113 INFO [train.py:996] (0/4) Epoch 7, batch 550, loss[loss=0.3002, simple_loss=0.3788, pruned_loss=0.1108, over 21742.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.2995, pruned_loss=0.07235, over 4012907.80 frames. ], batch size: 414, lr: 4.48e-03, grad_scale: 16.0 2023-06-24 12:36:55,719 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1101102.0, ans=0.125 2023-06-24 12:37:38,677 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1101222.0, ans=0.1 2023-06-24 12:37:59,290 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1101282.0, ans=0.1 2023-06-24 12:38:13,506 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1101342.0, ans=0.125 2023-06-24 12:38:14,377 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.022e+02 2.641e+02 3.136e+02 3.627e+02 5.437e+02, threshold=6.272e+02, percent-clipped=0.0 2023-06-24 12:38:28,513 INFO [train.py:996] (0/4) Epoch 7, batch 600, loss[loss=0.22, simple_loss=0.3132, pruned_loss=0.06339, over 21557.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.3022, pruned_loss=0.07197, over 4071040.69 frames. ], batch size: 230, lr: 4.48e-03, grad_scale: 8.0 2023-06-24 12:38:41,635 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1101402.0, ans=0.125 2023-06-24 12:39:02,828 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=1101462.0, ans=10.0 2023-06-24 12:39:23,165 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1101522.0, ans=0.0 2023-06-24 12:39:25,191 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 12:39:40,695 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1101582.0, ans=0.125 2023-06-24 12:39:58,328 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1101642.0, ans=0.125 2023-06-24 12:40:16,859 INFO [train.py:996] (0/4) Epoch 7, batch 650, loss[loss=0.2267, simple_loss=0.3167, pruned_loss=0.06836, over 21444.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.3047, pruned_loss=0.07276, over 4109526.26 frames. ], batch size: 211, lr: 4.48e-03, grad_scale: 8.0 2023-06-24 12:40:35,262 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1101702.0, ans=0.125 2023-06-24 12:40:38,432 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1101762.0, ans=0.125 2023-06-24 12:41:10,228 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1101822.0, ans=0.125 2023-06-24 12:41:15,800 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1101822.0, ans=0.125 2023-06-24 12:41:26,490 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.32 vs. limit=10.0 2023-06-24 12:41:51,784 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.212e+02 2.716e+02 3.087e+02 3.645e+02 5.920e+02, threshold=6.175e+02, percent-clipped=0.0 2023-06-24 12:42:05,937 INFO [train.py:996] (0/4) Epoch 7, batch 700, loss[loss=0.1935, simple_loss=0.2577, pruned_loss=0.06462, over 21582.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.3057, pruned_loss=0.07363, over 4156626.19 frames. ], batch size: 247, lr: 4.48e-03, grad_scale: 8.0 2023-06-24 12:42:07,241 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.53 vs. limit=15.0 2023-06-24 12:43:19,781 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1102182.0, ans=0.1 2023-06-24 12:43:30,566 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1102242.0, ans=0.5 2023-06-24 12:43:53,585 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1102302.0, ans=0.0 2023-06-24 12:43:59,370 INFO [train.py:996] (0/4) Epoch 7, batch 750, loss[loss=0.2344, simple_loss=0.3092, pruned_loss=0.07977, over 21730.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3061, pruned_loss=0.07471, over 4190439.27 frames. ], batch size: 351, lr: 4.48e-03, grad_scale: 8.0 2023-06-24 12:44:09,045 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1102302.0, ans=0.0 2023-06-24 12:44:40,420 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1102362.0, ans=0.125 2023-06-24 12:44:55,808 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1102422.0, ans=0.2 2023-06-24 12:45:28,800 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.965e+02 2.942e+02 3.385e+02 4.235e+02 7.679e+02, threshold=6.771e+02, percent-clipped=3.0 2023-06-24 12:45:43,001 INFO [train.py:996] (0/4) Epoch 7, batch 800, loss[loss=0.1947, simple_loss=0.2645, pruned_loss=0.0625, over 21712.00 frames. ], tot_loss[loss=0.227, simple_loss=0.3044, pruned_loss=0.07482, over 4211840.82 frames. ], batch size: 282, lr: 4.48e-03, grad_scale: 16.0 2023-06-24 12:45:54,504 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.46 vs. limit=15.0 2023-06-24 12:46:21,356 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.61 vs. limit=22.5 2023-06-24 12:46:30,300 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1102662.0, ans=0.125 2023-06-24 12:46:33,175 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1102722.0, ans=0.0 2023-06-24 12:46:33,217 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1102722.0, ans=0.2 2023-06-24 12:46:46,691 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.99 vs. limit=15.0 2023-06-24 12:47:01,862 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.38 vs. limit=12.0 2023-06-24 12:47:38,962 INFO [train.py:996] (0/4) Epoch 7, batch 850, loss[loss=0.2243, simple_loss=0.2966, pruned_loss=0.07602, over 21279.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.3023, pruned_loss=0.07447, over 4231541.57 frames. ], batch size: 143, lr: 4.47e-03, grad_scale: 16.0 2023-06-24 12:48:16,469 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1102962.0, ans=0.0 2023-06-24 12:49:06,402 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1103142.0, ans=0.1 2023-06-24 12:49:07,528 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.999e+02 2.715e+02 3.192e+02 3.563e+02 7.547e+02, threshold=6.383e+02, percent-clipped=1.0 2023-06-24 12:49:27,455 INFO [train.py:996] (0/4) Epoch 7, batch 900, loss[loss=0.2257, simple_loss=0.303, pruned_loss=0.07419, over 20201.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.2986, pruned_loss=0.07391, over 4248167.14 frames. ], batch size: 703, lr: 4.47e-03, grad_scale: 16.0 2023-06-24 12:50:06,288 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1103262.0, ans=0.125 2023-06-24 12:50:08,820 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.55 vs. limit=22.5 2023-06-24 12:50:27,857 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1103322.0, ans=0.2 2023-06-24 12:51:17,589 INFO [train.py:996] (0/4) Epoch 7, batch 950, loss[loss=0.2201, simple_loss=0.2966, pruned_loss=0.07183, over 21444.00 frames. ], tot_loss[loss=0.222, simple_loss=0.2963, pruned_loss=0.07383, over 4263546.69 frames. ], batch size: 194, lr: 4.47e-03, grad_scale: 16.0 2023-06-24 12:52:04,558 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1103622.0, ans=0.125 2023-06-24 12:52:09,685 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1103622.0, ans=0.125 2023-06-24 12:52:35,144 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.26 vs. limit=6.0 2023-06-24 12:52:59,201 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.008e+02 2.594e+02 2.897e+02 3.337e+02 7.292e+02, threshold=5.794e+02, percent-clipped=1.0 2023-06-24 12:53:06,775 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1103802.0, ans=0.125 2023-06-24 12:53:07,714 INFO [train.py:996] (0/4) Epoch 7, batch 1000, loss[loss=0.2025, simple_loss=0.2954, pruned_loss=0.05477, over 21690.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.2952, pruned_loss=0.07327, over 4273507.05 frames. ], batch size: 263, lr: 4.47e-03, grad_scale: 16.0 2023-06-24 12:53:08,558 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 12:53:30,691 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.61 vs. limit=15.0 2023-06-24 12:53:56,900 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1103922.0, ans=0.1 2023-06-24 12:54:06,430 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=1103922.0, ans=15.0 2023-06-24 12:54:09,562 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1103922.0, ans=0.125 2023-06-24 12:54:15,845 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-184000.pt 2023-06-24 12:55:12,159 INFO [train.py:996] (0/4) Epoch 7, batch 1050, loss[loss=0.2249, simple_loss=0.2918, pruned_loss=0.07901, over 21812.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.2925, pruned_loss=0.07236, over 4272148.37 frames. ], batch size: 351, lr: 4.47e-03, grad_scale: 16.0 2023-06-24 12:55:39,600 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1104162.0, ans=0.125 2023-06-24 12:55:47,841 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1104222.0, ans=0.0 2023-06-24 12:56:20,610 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1104282.0, ans=0.125 2023-06-24 12:56:42,630 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.988e+02 2.809e+02 3.239e+02 3.685e+02 6.477e+02, threshold=6.478e+02, percent-clipped=3.0 2023-06-24 12:56:57,497 INFO [train.py:996] (0/4) Epoch 7, batch 1100, loss[loss=0.2307, simple_loss=0.308, pruned_loss=0.07667, over 21779.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.292, pruned_loss=0.07234, over 4277586.91 frames. ], batch size: 414, lr: 4.47e-03, grad_scale: 16.0 2023-06-24 12:57:12,149 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1104402.0, ans=0.125 2023-06-24 12:57:26,441 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1104462.0, ans=0.125 2023-06-24 12:57:57,093 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1104582.0, ans=0.2 2023-06-24 12:58:11,236 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1104582.0, ans=0.1 2023-06-24 12:58:20,922 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1104582.0, ans=0.125 2023-06-24 12:58:47,166 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1104702.0, ans=0.125 2023-06-24 12:58:48,092 INFO [train.py:996] (0/4) Epoch 7, batch 1150, loss[loss=0.227, simple_loss=0.2992, pruned_loss=0.07743, over 21478.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.2933, pruned_loss=0.07289, over 4276846.78 frames. ], batch size: 548, lr: 4.47e-03, grad_scale: 16.0 2023-06-24 12:59:52,155 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 13:00:10,295 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1104882.0, ans=0.1 2023-06-24 13:00:30,293 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.093e+02 2.493e+02 2.841e+02 3.361e+02 6.236e+02, threshold=5.682e+02, percent-clipped=0.0 2023-06-24 13:00:30,952 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1104942.0, ans=0.0 2023-06-24 13:00:38,722 INFO [train.py:996] (0/4) Epoch 7, batch 1200, loss[loss=0.239, simple_loss=0.3131, pruned_loss=0.0824, over 21638.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.2955, pruned_loss=0.0727, over 4279233.83 frames. ], batch size: 263, lr: 4.47e-03, grad_scale: 32.0 2023-06-24 13:00:40,012 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.20 vs. limit=22.5 2023-06-24 13:00:41,631 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1105002.0, ans=0.1 2023-06-24 13:00:46,311 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=1105002.0, ans=0.05 2023-06-24 13:00:49,595 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1105002.0, ans=0.125 2023-06-24 13:01:08,741 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1105062.0, ans=0.125 2023-06-24 13:01:59,559 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.56 vs. limit=22.5 2023-06-24 13:02:28,497 INFO [train.py:996] (0/4) Epoch 7, batch 1250, loss[loss=0.2118, simple_loss=0.2876, pruned_loss=0.06797, over 21510.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.2985, pruned_loss=0.07461, over 4278614.67 frames. ], batch size: 131, lr: 4.47e-03, grad_scale: 32.0 2023-06-24 13:02:29,728 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.58 vs. limit=12.0 2023-06-24 13:03:15,611 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 13:04:06,512 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1105542.0, ans=0.1 2023-06-24 13:04:09,150 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.126e+02 2.694e+02 3.114e+02 3.849e+02 5.488e+02, threshold=6.227e+02, percent-clipped=0.0 2023-06-24 13:04:18,056 INFO [train.py:996] (0/4) Epoch 7, batch 1300, loss[loss=0.321, simple_loss=0.3995, pruned_loss=0.1213, over 21523.00 frames. ], tot_loss[loss=0.226, simple_loss=0.3016, pruned_loss=0.07524, over 4280485.80 frames. ], batch size: 507, lr: 4.47e-03, grad_scale: 32.0 2023-06-24 13:04:33,564 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.18 vs. limit=22.5 2023-06-24 13:05:16,945 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.71 vs. limit=10.0 2023-06-24 13:06:06,816 INFO [train.py:996] (0/4) Epoch 7, batch 1350, loss[loss=0.2087, simple_loss=0.281, pruned_loss=0.06824, over 21820.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.3017, pruned_loss=0.0757, over 4286688.03 frames. ], batch size: 124, lr: 4.47e-03, grad_scale: 32.0 2023-06-24 13:06:19,042 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=1105902.0, ans=15.0 2023-06-24 13:06:48,558 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1106022.0, ans=0.05 2023-06-24 13:07:02,214 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1106022.0, ans=0.04949747468305833 2023-06-24 13:07:18,916 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1106082.0, ans=0.025 2023-06-24 13:07:31,695 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1106082.0, ans=0.125 2023-06-24 13:07:48,037 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.180e+02 2.498e+02 2.809e+02 3.151e+02 4.941e+02, threshold=5.617e+02, percent-clipped=0.0 2023-06-24 13:07:56,339 INFO [train.py:996] (0/4) Epoch 7, batch 1400, loss[loss=0.2379, simple_loss=0.3105, pruned_loss=0.08259, over 21879.00 frames. ], tot_loss[loss=0.225, simple_loss=0.2998, pruned_loss=0.07511, over 4290391.06 frames. ], batch size: 332, lr: 4.47e-03, grad_scale: 32.0 2023-06-24 13:09:04,781 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1106382.0, ans=0.125 2023-06-24 13:09:17,464 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.10 vs. limit=15.0 2023-06-24 13:09:25,224 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1106382.0, ans=0.125 2023-06-24 13:09:41,939 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.28 vs. limit=15.0 2023-06-24 13:09:46,118 INFO [train.py:996] (0/4) Epoch 7, batch 1450, loss[loss=0.1946, simple_loss=0.2715, pruned_loss=0.05883, over 21816.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.3016, pruned_loss=0.07639, over 4285963.74 frames. ], batch size: 107, lr: 4.47e-03, grad_scale: 16.0 2023-06-24 13:10:00,819 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1106502.0, ans=0.125 2023-06-24 13:10:02,328 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1106562.0, ans=0.125 2023-06-24 13:10:05,909 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1106562.0, ans=0.125 2023-06-24 13:10:09,163 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1106562.0, ans=0.125 2023-06-24 13:10:13,403 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1106562.0, ans=0.1 2023-06-24 13:10:38,008 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1106622.0, ans=0.125 2023-06-24 13:10:38,038 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1106622.0, ans=0.1 2023-06-24 13:11:17,463 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1106742.0, ans=0.125 2023-06-24 13:11:28,956 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.014e+02 2.734e+02 3.228e+02 3.700e+02 6.613e+02, threshold=6.455e+02, percent-clipped=4.0 2023-06-24 13:11:36,321 INFO [train.py:996] (0/4) Epoch 7, batch 1500, loss[loss=0.2613, simple_loss=0.3391, pruned_loss=0.09175, over 21622.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3025, pruned_loss=0.07722, over 4289723.27 frames. ], batch size: 441, lr: 4.47e-03, grad_scale: 16.0 2023-06-24 13:11:36,980 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1106802.0, ans=0.2 2023-06-24 13:11:45,748 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1106802.0, ans=0.1 2023-06-24 13:11:47,147 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1106802.0, ans=0.125 2023-06-24 13:11:48,761 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1106802.0, ans=0.125 2023-06-24 13:12:55,348 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1106982.0, ans=0.0 2023-06-24 13:13:05,277 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.85 vs. limit=22.5 2023-06-24 13:13:24,210 INFO [train.py:996] (0/4) Epoch 7, batch 1550, loss[loss=0.2056, simple_loss=0.2654, pruned_loss=0.07291, over 20190.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.3028, pruned_loss=0.0772, over 4286835.41 frames. ], batch size: 703, lr: 4.47e-03, grad_scale: 16.0 2023-06-24 13:13:56,582 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1107162.0, ans=0.125 2023-06-24 13:14:24,886 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1107222.0, ans=0.0 2023-06-24 13:14:38,866 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1107282.0, ans=0.0 2023-06-24 13:14:42,190 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1107282.0, ans=0.0 2023-06-24 13:15:01,736 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1107342.0, ans=0.1 2023-06-24 13:15:06,853 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.150e+02 2.619e+02 3.008e+02 3.656e+02 5.850e+02, threshold=6.017e+02, percent-clipped=0.0 2023-06-24 13:15:07,563 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1107342.0, ans=0.125 2023-06-24 13:15:13,449 INFO [train.py:996] (0/4) Epoch 7, batch 1600, loss[loss=0.2354, simple_loss=0.3106, pruned_loss=0.08014, over 21775.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.3015, pruned_loss=0.0766, over 4288039.89 frames. ], batch size: 441, lr: 4.47e-03, grad_scale: 32.0 2023-06-24 13:15:16,409 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1107402.0, ans=0.015 2023-06-24 13:15:29,001 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1107402.0, ans=0.0 2023-06-24 13:15:32,904 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.68 vs. limit=6.0 2023-06-24 13:16:01,693 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1107462.0, ans=0.2 2023-06-24 13:16:26,101 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1107522.0, ans=0.125 2023-06-24 13:16:32,993 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1107582.0, ans=0.2 2023-06-24 13:16:55,545 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.23 vs. limit=15.0 2023-06-24 13:17:11,098 INFO [train.py:996] (0/4) Epoch 7, batch 1650, loss[loss=0.2502, simple_loss=0.3188, pruned_loss=0.09078, over 21337.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.3011, pruned_loss=0.07563, over 4284930.17 frames. ], batch size: 131, lr: 4.46e-03, grad_scale: 32.0 2023-06-24 13:18:02,656 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1107822.0, ans=0.04949747468305833 2023-06-24 13:18:19,097 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1107822.0, ans=0.125 2023-06-24 13:18:55,914 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.937e+02 2.769e+02 3.129e+02 3.705e+02 6.024e+02, threshold=6.259e+02, percent-clipped=1.0 2023-06-24 13:19:01,294 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.01 vs. limit=12.0 2023-06-24 13:19:03,724 INFO [train.py:996] (0/4) Epoch 7, batch 1700, loss[loss=0.2209, simple_loss=0.2935, pruned_loss=0.07414, over 20210.00 frames. ], tot_loss[loss=0.228, simple_loss=0.3036, pruned_loss=0.0762, over 4280910.70 frames. ], batch size: 702, lr: 4.46e-03, grad_scale: 32.0 2023-06-24 13:19:32,700 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1108062.0, ans=0.0 2023-06-24 13:20:14,317 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1108182.0, ans=0.125 2023-06-24 13:21:02,720 INFO [train.py:996] (0/4) Epoch 7, batch 1750, loss[loss=0.2253, simple_loss=0.3555, pruned_loss=0.0476, over 19781.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.3022, pruned_loss=0.07441, over 4280398.73 frames. ], batch size: 702, lr: 4.46e-03, grad_scale: 16.0 2023-06-24 13:21:49,071 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1108422.0, ans=0.015 2023-06-24 13:21:50,717 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1108422.0, ans=0.0 2023-06-24 13:22:50,229 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1108542.0, ans=0.125 2023-06-24 13:22:56,564 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.880e+02 2.765e+02 3.316e+02 4.331e+02 7.357e+02, threshold=6.632e+02, percent-clipped=3.0 2023-06-24 13:23:07,060 INFO [train.py:996] (0/4) Epoch 7, batch 1800, loss[loss=0.2457, simple_loss=0.346, pruned_loss=0.07275, over 21658.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.3005, pruned_loss=0.07285, over 4282146.18 frames. ], batch size: 414, lr: 4.46e-03, grad_scale: 16.0 2023-06-24 13:24:07,592 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1108782.0, ans=0.0 2023-06-24 13:24:09,517 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1108782.0, ans=0.0 2023-06-24 13:24:14,748 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1108782.0, ans=0.125 2023-06-24 13:24:52,501 INFO [train.py:996] (0/4) Epoch 7, batch 1850, loss[loss=0.2077, simple_loss=0.2847, pruned_loss=0.06534, over 21789.00 frames. ], tot_loss[loss=0.2219, simple_loss=0.3009, pruned_loss=0.07139, over 4276597.07 frames. ], batch size: 247, lr: 4.46e-03, grad_scale: 8.0 2023-06-24 13:25:13,010 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.41 vs. limit=15.0 2023-06-24 13:25:31,811 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1109022.0, ans=0.0 2023-06-24 13:25:42,322 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1109022.0, ans=0.125 2023-06-24 13:25:51,443 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1109022.0, ans=0.125 2023-06-24 13:25:52,106 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.85 vs. limit=10.0 2023-06-24 13:25:59,275 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.08 vs. limit=15.0 2023-06-24 13:26:00,475 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1109082.0, ans=0.2 2023-06-24 13:26:38,098 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1109142.0, ans=0.125 2023-06-24 13:26:38,968 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.824e+02 2.879e+02 3.449e+02 4.316e+02 7.592e+02, threshold=6.898e+02, percent-clipped=3.0 2023-06-24 13:26:47,949 INFO [train.py:996] (0/4) Epoch 7, batch 1900, loss[loss=0.249, simple_loss=0.3143, pruned_loss=0.09182, over 21989.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.3036, pruned_loss=0.07159, over 4283119.31 frames. ], batch size: 103, lr: 4.46e-03, grad_scale: 8.0 2023-06-24 13:27:41,173 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1109322.0, ans=0.125 2023-06-24 13:28:38,153 INFO [train.py:996] (0/4) Epoch 7, batch 1950, loss[loss=0.2639, simple_loss=0.3533, pruned_loss=0.08726, over 21673.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.3001, pruned_loss=0.07207, over 4271212.81 frames. ], batch size: 441, lr: 4.46e-03, grad_scale: 8.0 2023-06-24 13:28:40,593 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1109502.0, ans=0.1 2023-06-24 13:29:01,865 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1109562.0, ans=0.2 2023-06-24 13:29:33,405 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.37 vs. limit=10.0 2023-06-24 13:29:50,205 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1109682.0, ans=0.0 2023-06-24 13:30:15,579 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.89 vs. limit=15.0 2023-06-24 13:30:26,491 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.057e+02 2.664e+02 3.137e+02 3.840e+02 6.499e+02, threshold=6.275e+02, percent-clipped=0.0 2023-06-24 13:30:29,936 INFO [train.py:996] (0/4) Epoch 7, batch 2000, loss[loss=0.2334, simple_loss=0.3095, pruned_loss=0.07865, over 21865.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2928, pruned_loss=0.06977, over 4259659.57 frames. ], batch size: 351, lr: 4.46e-03, grad_scale: 16.0 2023-06-24 13:30:51,676 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1109862.0, ans=0.2 2023-06-24 13:31:22,459 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 13:31:27,777 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1109982.0, ans=0.125 2023-06-24 13:31:46,885 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1109982.0, ans=0.125 2023-06-24 13:32:00,823 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1110042.0, ans=0.125 2023-06-24 13:32:06,485 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.35 vs. limit=15.0 2023-06-24 13:32:20,850 INFO [train.py:996] (0/4) Epoch 7, batch 2050, loss[loss=0.2457, simple_loss=0.3295, pruned_loss=0.08093, over 21664.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2941, pruned_loss=0.06953, over 4266489.47 frames. ], batch size: 441, lr: 4.46e-03, grad_scale: 16.0 2023-06-24 13:32:42,203 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1110162.0, ans=0.125 2023-06-24 13:33:35,990 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1110282.0, ans=0.1 2023-06-24 13:33:52,659 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.94 vs. limit=15.0 2023-06-24 13:34:07,173 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.959e+02 2.697e+02 3.083e+02 3.787e+02 7.892e+02, threshold=6.165e+02, percent-clipped=1.0 2023-06-24 13:34:10,758 INFO [train.py:996] (0/4) Epoch 7, batch 2100, loss[loss=0.1858, simple_loss=0.2532, pruned_loss=0.05915, over 21741.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.298, pruned_loss=0.07156, over 4271957.32 frames. ], batch size: 351, lr: 4.46e-03, grad_scale: 16.0 2023-06-24 13:34:20,767 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.02 vs. limit=15.0 2023-06-24 13:34:30,155 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1110462.0, ans=0.07 2023-06-24 13:35:22,313 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1110582.0, ans=0.125 2023-06-24 13:35:23,793 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 13:36:02,122 INFO [train.py:996] (0/4) Epoch 7, batch 2150, loss[loss=0.2388, simple_loss=0.3241, pruned_loss=0.07678, over 21312.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.2986, pruned_loss=0.07293, over 4273354.92 frames. ], batch size: 548, lr: 4.46e-03, grad_scale: 16.0 2023-06-24 13:36:14,216 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1110702.0, ans=0.1 2023-06-24 13:37:11,529 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1110882.0, ans=0.2 2023-06-24 13:37:17,291 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.89 vs. limit=15.0 2023-06-24 13:37:24,164 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1110882.0, ans=0.1 2023-06-24 13:37:24,706 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.05 vs. limit=12.0 2023-06-24 13:37:49,021 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.267e+02 2.806e+02 3.490e+02 4.529e+02 7.299e+02, threshold=6.981e+02, percent-clipped=4.0 2023-06-24 13:37:51,618 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1111002.0, ans=0.2 2023-06-24 13:37:52,599 INFO [train.py:996] (0/4) Epoch 7, batch 2200, loss[loss=0.2082, simple_loss=0.2926, pruned_loss=0.06194, over 21794.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.3023, pruned_loss=0.07356, over 4279041.91 frames. ], batch size: 282, lr: 4.46e-03, grad_scale: 16.0 2023-06-24 13:39:03,058 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1111182.0, ans=0.015 2023-06-24 13:39:06,972 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 13:39:07,011 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1111182.0, ans=0.125 2023-06-24 13:39:15,068 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1111182.0, ans=0.125 2023-06-24 13:39:40,180 INFO [train.py:996] (0/4) Epoch 7, batch 2250, loss[loss=0.1971, simple_loss=0.2635, pruned_loss=0.06536, over 21686.00 frames. ], tot_loss[loss=0.2219, simple_loss=0.2995, pruned_loss=0.07213, over 4277917.88 frames. ], batch size: 282, lr: 4.46e-03, grad_scale: 16.0 2023-06-24 13:39:45,607 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1111302.0, ans=0.2 2023-06-24 13:40:37,023 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1111422.0, ans=0.1 2023-06-24 13:41:06,741 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1111542.0, ans=0.07 2023-06-24 13:41:24,939 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.962e+02 2.731e+02 3.125e+02 3.958e+02 6.138e+02, threshold=6.249e+02, percent-clipped=0.0 2023-06-24 13:41:28,576 INFO [train.py:996] (0/4) Epoch 7, batch 2300, loss[loss=0.1997, simple_loss=0.2686, pruned_loss=0.06533, over 21833.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.2964, pruned_loss=0.0715, over 4281053.76 frames. ], batch size: 107, lr: 4.46e-03, grad_scale: 16.0 2023-06-24 13:41:49,586 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1111662.0, ans=0.1 2023-06-24 13:42:08,437 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.92 vs. limit=15.0 2023-06-24 13:42:12,047 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1111722.0, ans=0.0 2023-06-24 13:43:17,669 INFO [train.py:996] (0/4) Epoch 7, batch 2350, loss[loss=0.2081, simple_loss=0.2804, pruned_loss=0.06792, over 21874.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.2957, pruned_loss=0.07192, over 4279044.91 frames. ], batch size: 98, lr: 4.46e-03, grad_scale: 16.0 2023-06-24 13:43:18,307 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1111902.0, ans=0.2 2023-06-24 13:44:25,552 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1112022.0, ans=0.125 2023-06-24 13:45:05,380 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.026e+02 2.752e+02 3.198e+02 3.763e+02 6.793e+02, threshold=6.396e+02, percent-clipped=2.0 2023-06-24 13:45:08,864 INFO [train.py:996] (0/4) Epoch 7, batch 2400, loss[loss=0.2496, simple_loss=0.3241, pruned_loss=0.08755, over 21914.00 frames. ], tot_loss[loss=0.224, simple_loss=0.2999, pruned_loss=0.07407, over 4278685.13 frames. ], batch size: 372, lr: 4.46e-03, grad_scale: 32.0 2023-06-24 13:45:44,559 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.56 vs. limit=22.5 2023-06-24 13:46:23,932 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1112382.0, ans=0.125 2023-06-24 13:46:47,288 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1112442.0, ans=0.125 2023-06-24 13:46:54,126 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1112442.0, ans=0.05 2023-06-24 13:46:59,017 INFO [train.py:996] (0/4) Epoch 7, batch 2450, loss[loss=0.2592, simple_loss=0.3357, pruned_loss=0.09136, over 21578.00 frames. ], tot_loss[loss=0.2276, simple_loss=0.3032, pruned_loss=0.07597, over 4279004.83 frames. ], batch size: 389, lr: 4.46e-03, grad_scale: 32.0 2023-06-24 13:47:39,920 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1112562.0, ans=0.1 2023-06-24 13:47:49,128 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1112622.0, ans=0.2 2023-06-24 13:48:03,991 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.84 vs. limit=15.0 2023-06-24 13:48:09,184 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.61 vs. limit=10.0 2023-06-24 13:48:48,359 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.212e+02 2.774e+02 3.648e+02 4.607e+02 7.858e+02, threshold=7.296e+02, percent-clipped=5.0 2023-06-24 13:48:51,812 INFO [train.py:996] (0/4) Epoch 7, batch 2500, loss[loss=0.2172, simple_loss=0.3003, pruned_loss=0.06712, over 21158.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.2999, pruned_loss=0.0746, over 4284565.46 frames. ], batch size: 548, lr: 4.45e-03, grad_scale: 32.0 2023-06-24 13:50:16,585 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1112982.0, ans=0.2 2023-06-24 13:50:28,743 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1113042.0, ans=0.125 2023-06-24 13:50:31,951 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1113042.0, ans=0.0 2023-06-24 13:50:42,239 INFO [train.py:996] (0/4) Epoch 7, batch 2550, loss[loss=0.2124, simple_loss=0.2801, pruned_loss=0.07229, over 21774.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.2974, pruned_loss=0.07314, over 4282879.67 frames. ], batch size: 351, lr: 4.45e-03, grad_scale: 32.0 2023-06-24 13:52:18,205 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1113342.0, ans=0.1 2023-06-24 13:52:30,406 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.150e+02 2.860e+02 3.358e+02 4.176e+02 6.278e+02, threshold=6.716e+02, percent-clipped=0.0 2023-06-24 13:52:32,026 INFO [train.py:996] (0/4) Epoch 7, batch 2600, loss[loss=0.2097, simple_loss=0.3009, pruned_loss=0.0592, over 21297.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.3006, pruned_loss=0.07493, over 4285375.40 frames. ], batch size: 176, lr: 4.45e-03, grad_scale: 16.0 2023-06-24 13:52:32,868 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1113402.0, ans=0.2 2023-06-24 13:52:37,537 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.94 vs. limit=15.0 2023-06-24 13:52:47,003 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1113402.0, ans=0.125 2023-06-24 13:54:18,901 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.65 vs. limit=22.5 2023-06-24 13:54:23,119 INFO [train.py:996] (0/4) Epoch 7, batch 2650, loss[loss=0.2198, simple_loss=0.2904, pruned_loss=0.07459, over 21852.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.3019, pruned_loss=0.0767, over 4285451.49 frames. ], batch size: 414, lr: 4.45e-03, grad_scale: 16.0 2023-06-24 13:54:52,607 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1113762.0, ans=0.125 2023-06-24 13:54:52,741 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1113762.0, ans=0.07 2023-06-24 13:55:25,917 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1113822.0, ans=0.0 2023-06-24 13:56:12,128 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.196e+02 2.624e+02 3.107e+02 3.655e+02 6.528e+02, threshold=6.215e+02, percent-clipped=0.0 2023-06-24 13:56:14,299 INFO [train.py:996] (0/4) Epoch 7, batch 2700, loss[loss=0.176, simple_loss=0.2452, pruned_loss=0.05346, over 21270.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.2996, pruned_loss=0.07655, over 4277453.86 frames. ], batch size: 176, lr: 4.45e-03, grad_scale: 16.0 2023-06-24 13:56:16,499 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1114002.0, ans=0.1 2023-06-24 13:56:27,588 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.29 vs. limit=15.0 2023-06-24 13:56:44,849 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1114062.0, ans=0.015 2023-06-24 13:56:48,705 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1114062.0, ans=0.05 2023-06-24 13:56:59,699 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1114122.0, ans=0.2 2023-06-24 13:57:30,065 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1114182.0, ans=0.1 2023-06-24 13:57:33,619 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1114182.0, ans=0.0 2023-06-24 13:58:04,723 INFO [train.py:996] (0/4) Epoch 7, batch 2750, loss[loss=0.2474, simple_loss=0.3261, pruned_loss=0.0844, over 21872.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.2968, pruned_loss=0.07582, over 4280279.48 frames. ], batch size: 107, lr: 4.45e-03, grad_scale: 16.0 2023-06-24 13:58:27,675 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1114362.0, ans=0.2 2023-06-24 13:59:37,017 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1114482.0, ans=0.035 2023-06-24 13:59:37,739 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.15 vs. limit=22.5 2023-06-24 13:59:43,981 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1114542.0, ans=0.0 2023-06-24 14:00:01,271 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.337e+02 2.964e+02 3.229e+02 3.808e+02 6.340e+02, threshold=6.458e+02, percent-clipped=1.0 2023-06-24 14:00:03,077 INFO [train.py:996] (0/4) Epoch 7, batch 2800, loss[loss=0.218, simple_loss=0.295, pruned_loss=0.0705, over 21077.00 frames. ], tot_loss[loss=0.227, simple_loss=0.3006, pruned_loss=0.07672, over 4281931.63 frames. ], batch size: 607, lr: 4.45e-03, grad_scale: 32.0 2023-06-24 14:00:07,839 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1114602.0, ans=0.125 2023-06-24 14:00:23,474 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1114602.0, ans=0.125 2023-06-24 14:00:25,908 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1114662.0, ans=0.125 2023-06-24 14:00:44,068 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.53 vs. limit=22.5 2023-06-24 14:00:49,335 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1114722.0, ans=0.125 2023-06-24 14:00:58,098 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1114722.0, ans=0.125 2023-06-24 14:01:36,809 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1114842.0, ans=0.0 2023-06-24 14:01:54,113 INFO [train.py:996] (0/4) Epoch 7, batch 2850, loss[loss=0.2623, simple_loss=0.3263, pruned_loss=0.09912, over 21361.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3022, pruned_loss=0.07736, over 4276137.00 frames. ], batch size: 549, lr: 4.45e-03, grad_scale: 32.0 2023-06-24 14:02:10,501 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1114902.0, ans=0.2 2023-06-24 14:02:47,757 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1115022.0, ans=0.0 2023-06-24 14:03:13,366 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1115082.0, ans=0.0 2023-06-24 14:03:20,719 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1115142.0, ans=0.125 2023-06-24 14:03:42,919 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.437e+02 2.854e+02 3.316e+02 3.985e+02 8.556e+02, threshold=6.632e+02, percent-clipped=4.0 2023-06-24 14:03:42,954 INFO [train.py:996] (0/4) Epoch 7, batch 2900, loss[loss=0.2018, simple_loss=0.2755, pruned_loss=0.06407, over 21830.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.3014, pruned_loss=0.07792, over 4277429.43 frames. ], batch size: 282, lr: 4.45e-03, grad_scale: 16.0 2023-06-24 14:03:48,514 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1115202.0, ans=0.125 2023-06-24 14:04:02,933 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1115202.0, ans=0.125 2023-06-24 14:04:06,304 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1115262.0, ans=0.2 2023-06-24 14:04:12,083 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1115262.0, ans=0.125 2023-06-24 14:04:24,293 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1115262.0, ans=0.0 2023-06-24 14:04:24,382 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1115262.0, ans=0.0 2023-06-24 14:05:18,587 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1115442.0, ans=0.125 2023-06-24 14:05:20,215 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 14:05:33,563 INFO [train.py:996] (0/4) Epoch 7, batch 2950, loss[loss=0.2261, simple_loss=0.291, pruned_loss=0.08058, over 21870.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3019, pruned_loss=0.07785, over 4286359.88 frames. ], batch size: 298, lr: 4.45e-03, grad_scale: 16.0 2023-06-24 14:05:51,250 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.16 vs. limit=6.0 2023-06-24 14:06:49,644 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1115682.0, ans=0.2 2023-06-24 14:06:54,199 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.16 vs. limit=12.0 2023-06-24 14:07:24,637 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.073e+02 2.857e+02 3.209e+02 3.929e+02 8.381e+02, threshold=6.419e+02, percent-clipped=2.0 2023-06-24 14:07:24,668 INFO [train.py:996] (0/4) Epoch 7, batch 3000, loss[loss=0.2664, simple_loss=0.3399, pruned_loss=0.09644, over 21423.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3057, pruned_loss=0.07734, over 4287653.03 frames. ], batch size: 159, lr: 4.45e-03, grad_scale: 16.0 2023-06-24 14:07:24,669 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-24 14:07:43,765 INFO [zipformer.py:1728] (0/4) name=encoder.encoders.3.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([2.2250, 2.9242, 2.9575, 3.2999, 2.7439, 2.8178, 3.3290, 3.3032], device='cuda:0') 2023-06-24 14:07:46,554 INFO [train.py:1028] (0/4) Epoch 7, validation: loss=0.2481, simple_loss=0.3407, pruned_loss=0.0778, over 1796401.00 frames. 2023-06-24 14:07:46,555 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23616MB 2023-06-24 14:08:00,333 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1115802.0, ans=0.125 2023-06-24 14:09:37,557 INFO [train.py:996] (0/4) Epoch 7, batch 3050, loss[loss=0.2081, simple_loss=0.2934, pruned_loss=0.06139, over 21833.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3063, pruned_loss=0.07606, over 4287186.48 frames. ], batch size: 371, lr: 4.45e-03, grad_scale: 16.0 2023-06-24 14:09:52,061 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1116102.0, ans=0.1 2023-06-24 14:10:46,018 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1116282.0, ans=0.0 2023-06-24 14:10:53,721 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.97 vs. limit=15.0 2023-06-24 14:11:33,742 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.863e+02 2.533e+02 2.915e+02 3.819e+02 6.639e+02, threshold=5.830e+02, percent-clipped=1.0 2023-06-24 14:11:33,773 INFO [train.py:996] (0/4) Epoch 7, batch 3100, loss[loss=0.1975, simple_loss=0.275, pruned_loss=0.06006, over 21487.00 frames. ], tot_loss[loss=0.2276, simple_loss=0.3055, pruned_loss=0.07486, over 4295416.36 frames. ], batch size: 211, lr: 4.45e-03, grad_scale: 16.0 2023-06-24 14:11:45,561 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.61 vs. limit=12.0 2023-06-24 14:12:03,299 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.06 vs. limit=6.0 2023-06-24 14:12:12,931 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1116522.0, ans=0.125 2023-06-24 14:12:47,800 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1116582.0, ans=0.125 2023-06-24 14:13:06,484 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1116642.0, ans=0.125 2023-06-24 14:13:12,082 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1116642.0, ans=0.125 2023-06-24 14:13:13,917 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1116642.0, ans=0.07 2023-06-24 14:13:25,692 INFO [train.py:996] (0/4) Epoch 7, batch 3150, loss[loss=0.2588, simple_loss=0.3324, pruned_loss=0.09262, over 21704.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3077, pruned_loss=0.07636, over 4295223.16 frames. ], batch size: 351, lr: 4.45e-03, grad_scale: 16.0 2023-06-24 14:13:26,502 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1116702.0, ans=0.0 2023-06-24 14:13:37,447 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1116702.0, ans=0.1 2023-06-24 14:13:44,248 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1116702.0, ans=0.125 2023-06-24 14:14:13,355 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.06 vs. limit=12.0 2023-06-24 14:14:22,815 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1116822.0, ans=0.125 2023-06-24 14:14:31,829 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.51 vs. limit=15.0 2023-06-24 14:14:36,207 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1116882.0, ans=0.125 2023-06-24 14:14:37,720 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 14:15:22,167 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.040e+02 2.712e+02 3.098e+02 3.534e+02 5.991e+02, threshold=6.196e+02, percent-clipped=1.0 2023-06-24 14:15:22,204 INFO [train.py:996] (0/4) Epoch 7, batch 3200, loss[loss=0.2096, simple_loss=0.2938, pruned_loss=0.06266, over 21788.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3089, pruned_loss=0.07601, over 4294422.95 frames. ], batch size: 282, lr: 4.45e-03, grad_scale: 32.0 2023-06-24 14:16:02,865 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1117122.0, ans=0.04949747468305833 2023-06-24 14:16:58,233 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1117242.0, ans=0.0 2023-06-24 14:17:12,253 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1117302.0, ans=0.125 2023-06-24 14:17:13,208 INFO [train.py:996] (0/4) Epoch 7, batch 3250, loss[loss=0.2827, simple_loss=0.3189, pruned_loss=0.1232, over 21451.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3104, pruned_loss=0.0781, over 4292445.33 frames. ], batch size: 510, lr: 4.45e-03, grad_scale: 16.0 2023-06-24 14:17:38,032 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.45 vs. limit=22.5 2023-06-24 14:18:13,686 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 14:18:42,970 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1117482.0, ans=0.125 2023-06-24 14:18:44,780 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1117482.0, ans=0.0 2023-06-24 14:18:47,171 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 14:19:01,265 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1117542.0, ans=0.125 2023-06-24 14:19:05,839 INFO [train.py:996] (0/4) Epoch 7, batch 3300, loss[loss=0.2123, simple_loss=0.3055, pruned_loss=0.05952, over 21731.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3043, pruned_loss=0.07726, over 4286222.66 frames. ], batch size: 282, lr: 4.45e-03, grad_scale: 16.0 2023-06-24 14:19:07,519 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.143e+02 2.718e+02 3.384e+02 4.609e+02 8.476e+02, threshold=6.767e+02, percent-clipped=13.0 2023-06-24 14:19:17,821 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.00 vs. limit=15.0 2023-06-24 14:20:44,757 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1117842.0, ans=0.2 2023-06-24 14:20:46,405 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1117842.0, ans=0.05 2023-06-24 14:20:56,361 INFO [train.py:996] (0/4) Epoch 7, batch 3350, loss[loss=0.2306, simple_loss=0.3039, pruned_loss=0.07869, over 21943.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3039, pruned_loss=0.07657, over 4278695.53 frames. ], batch size: 316, lr: 4.44e-03, grad_scale: 16.0 2023-06-24 14:22:27,785 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1118082.0, ans=0.1 2023-06-24 14:22:53,123 INFO [train.py:996] (0/4) Epoch 7, batch 3400, loss[loss=0.2006, simple_loss=0.2807, pruned_loss=0.06028, over 21611.00 frames. ], tot_loss[loss=0.229, simple_loss=0.3036, pruned_loss=0.07713, over 4280952.53 frames. ], batch size: 247, lr: 4.44e-03, grad_scale: 16.0 2023-06-24 14:22:54,754 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.113e+02 2.810e+02 3.179e+02 3.983e+02 5.568e+02, threshold=6.357e+02, percent-clipped=0.0 2023-06-24 14:24:01,138 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1118382.0, ans=0.1 2023-06-24 14:24:31,460 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1118442.0, ans=0.125 2023-06-24 14:24:38,640 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1118442.0, ans=0.0 2023-06-24 14:24:43,521 INFO [train.py:996] (0/4) Epoch 7, batch 3450, loss[loss=0.2298, simple_loss=0.2956, pruned_loss=0.08198, over 21190.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.3001, pruned_loss=0.07687, over 4276452.93 frames. ], batch size: 143, lr: 4.44e-03, grad_scale: 16.0 2023-06-24 14:24:44,349 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1118502.0, ans=0.1 2023-06-24 14:25:39,102 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 14:25:48,314 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1118622.0, ans=0.1 2023-06-24 14:26:23,168 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1118742.0, ans=0.125 2023-06-24 14:26:26,269 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1118742.0, ans=0.125 2023-06-24 14:26:32,433 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 14:26:36,949 INFO [train.py:996] (0/4) Epoch 7, batch 3500, loss[loss=0.2725, simple_loss=0.3668, pruned_loss=0.08911, over 21588.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3085, pruned_loss=0.07941, over 4273576.75 frames. ], batch size: 263, lr: 4.44e-03, grad_scale: 16.0 2023-06-24 14:26:37,622 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1118802.0, ans=0.0 2023-06-24 14:26:38,660 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.230e+02 2.694e+02 2.966e+02 3.710e+02 5.580e+02, threshold=5.932e+02, percent-clipped=0.0 2023-06-24 14:27:06,791 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.87 vs. limit=15.0 2023-06-24 14:27:33,641 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1118922.0, ans=0.1 2023-06-24 14:27:37,113 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1118922.0, ans=0.0 2023-06-24 14:28:33,478 INFO [train.py:996] (0/4) Epoch 7, batch 3550, loss[loss=0.2113, simple_loss=0.277, pruned_loss=0.07274, over 21874.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3109, pruned_loss=0.08013, over 4268835.48 frames. ], batch size: 372, lr: 4.44e-03, grad_scale: 16.0 2023-06-24 14:28:35,857 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1119102.0, ans=0.125 2023-06-24 14:28:47,980 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1119102.0, ans=0.125 2023-06-24 14:29:16,707 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.34 vs. limit=15.0 2023-06-24 14:30:24,486 INFO [train.py:996] (0/4) Epoch 7, batch 3600, loss[loss=0.2237, simple_loss=0.3228, pruned_loss=0.06225, over 20667.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3056, pruned_loss=0.07909, over 4267647.44 frames. ], batch size: 607, lr: 4.44e-03, grad_scale: 32.0 2023-06-24 14:30:31,400 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.346e+02 2.873e+02 3.282e+02 3.993e+02 6.971e+02, threshold=6.565e+02, percent-clipped=2.0 2023-06-24 14:31:10,098 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1119522.0, ans=0.2 2023-06-24 14:32:22,677 INFO [train.py:996] (0/4) Epoch 7, batch 3650, loss[loss=0.308, simple_loss=0.4199, pruned_loss=0.09802, over 19934.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3081, pruned_loss=0.07979, over 4269880.42 frames. ], batch size: 702, lr: 4.44e-03, grad_scale: 32.0 2023-06-24 14:32:31,926 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1119702.0, ans=0.0 2023-06-24 14:32:54,546 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1119762.0, ans=0.125 2023-06-24 14:32:54,689 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1119762.0, ans=0.2 2023-06-24 14:33:08,608 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1119822.0, ans=0.0 2023-06-24 14:33:45,863 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1119942.0, ans=0.125 2023-06-24 14:34:06,797 INFO [train.py:996] (0/4) Epoch 7, batch 3700, loss[loss=0.2465, simple_loss=0.3219, pruned_loss=0.0855, over 21839.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3072, pruned_loss=0.07863, over 4275960.66 frames. ], batch size: 414, lr: 4.44e-03, grad_scale: 32.0 2023-06-24 14:34:08,354 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.135e+02 2.822e+02 3.276e+02 3.785e+02 7.589e+02, threshold=6.551e+02, percent-clipped=1.0 2023-06-24 14:34:56,896 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1120122.0, ans=0.035 2023-06-24 14:36:02,189 INFO [train.py:996] (0/4) Epoch 7, batch 3750, loss[loss=0.1388, simple_loss=0.1803, pruned_loss=0.04867, over 16872.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3056, pruned_loss=0.07769, over 4272690.68 frames. ], batch size: 60, lr: 4.44e-03, grad_scale: 32.0 2023-06-24 14:36:32,685 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1120362.0, ans=0.0 2023-06-24 14:37:02,498 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.36 vs. limit=15.0 2023-06-24 14:37:39,477 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1120542.0, ans=0.125 2023-06-24 14:37:57,855 INFO [train.py:996] (0/4) Epoch 7, batch 3800, loss[loss=0.2348, simple_loss=0.3058, pruned_loss=0.08186, over 21375.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.3031, pruned_loss=0.07615, over 4282664.89 frames. ], batch size: 176, lr: 4.44e-03, grad_scale: 16.0 2023-06-24 14:38:01,823 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.853e+02 2.713e+02 3.064e+02 3.470e+02 5.470e+02, threshold=6.128e+02, percent-clipped=0.0 2023-06-24 14:38:14,603 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=1120662.0, ans=0.05 2023-06-24 14:39:49,643 INFO [train.py:996] (0/4) Epoch 7, batch 3850, loss[loss=0.2509, simple_loss=0.2826, pruned_loss=0.1097, over 21403.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.3016, pruned_loss=0.07687, over 4281221.74 frames. ], batch size: 509, lr: 4.44e-03, grad_scale: 16.0 2023-06-24 14:40:03,579 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1120902.0, ans=0.0 2023-06-24 14:40:09,107 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.00 vs. limit=12.0 2023-06-24 14:41:33,216 INFO [train.py:996] (0/4) Epoch 7, batch 3900, loss[loss=0.3251, simple_loss=0.4342, pruned_loss=0.108, over 21225.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.2996, pruned_loss=0.0769, over 4276755.39 frames. ], batch size: 549, lr: 4.44e-03, grad_scale: 16.0 2023-06-24 14:41:36,449 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.175e+02 2.710e+02 3.145e+02 3.584e+02 6.226e+02, threshold=6.291e+02, percent-clipped=1.0 2023-06-24 14:42:21,019 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1121322.0, ans=0.0 2023-06-24 14:42:26,660 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1121322.0, ans=0.125 2023-06-24 14:43:10,660 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1121442.0, ans=0.125 2023-06-24 14:43:24,057 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.39 vs. limit=12.0 2023-06-24 14:43:27,032 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1121442.0, ans=0.1 2023-06-24 14:43:31,392 INFO [train.py:996] (0/4) Epoch 7, batch 3950, loss[loss=0.2168, simple_loss=0.2868, pruned_loss=0.07341, over 20149.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.3, pruned_loss=0.07591, over 4267732.43 frames. ], batch size: 703, lr: 4.44e-03, grad_scale: 16.0 2023-06-24 14:43:50,945 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.19 vs. limit=15.0 2023-06-24 14:44:42,398 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1121682.0, ans=0.125 2023-06-24 14:44:42,441 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1121682.0, ans=0.125 2023-06-24 14:45:22,790 INFO [train.py:996] (0/4) Epoch 7, batch 4000, loss[loss=0.1933, simple_loss=0.2609, pruned_loss=0.06288, over 21522.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.2939, pruned_loss=0.07288, over 4265585.72 frames. ], batch size: 195, lr: 4.44e-03, grad_scale: 32.0 2023-06-24 14:45:26,597 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.070e+02 2.561e+02 2.888e+02 3.482e+02 6.063e+02, threshold=5.775e+02, percent-clipped=0.0 2023-06-24 14:45:34,756 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.61 vs. limit=6.0 2023-06-24 14:45:50,992 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1121862.0, ans=0.1 2023-06-24 14:46:20,316 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1121922.0, ans=0.125 2023-06-24 14:47:13,459 INFO [train.py:996] (0/4) Epoch 7, batch 4050, loss[loss=0.2247, simple_loss=0.3059, pruned_loss=0.07176, over 21719.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.2938, pruned_loss=0.07182, over 4267067.86 frames. ], batch size: 389, lr: 4.44e-03, grad_scale: 32.0 2023-06-24 14:47:55,010 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.78 vs. limit=15.0 2023-06-24 14:48:31,610 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 14:48:34,739 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1122282.0, ans=0.0 2023-06-24 14:48:41,071 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.43 vs. limit=22.5 2023-06-24 14:49:04,258 INFO [train.py:996] (0/4) Epoch 7, batch 4100, loss[loss=0.2189, simple_loss=0.2867, pruned_loss=0.07558, over 21583.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.2949, pruned_loss=0.07179, over 4273738.30 frames. ], batch size: 548, lr: 4.44e-03, grad_scale: 16.0 2023-06-24 14:49:08,899 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.185e+02 2.546e+02 2.998e+02 3.545e+02 8.551e+02, threshold=5.997e+02, percent-clipped=3.0 2023-06-24 14:49:20,070 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1122402.0, ans=0.125 2023-06-24 14:49:32,894 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1122462.0, ans=0.125 2023-06-24 14:49:36,600 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.86 vs. limit=15.0 2023-06-24 14:50:01,304 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1122522.0, ans=0.2 2023-06-24 14:50:41,709 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1122642.0, ans=0.125 2023-06-24 14:50:54,068 INFO [train.py:996] (0/4) Epoch 7, batch 4150, loss[loss=0.1998, simple_loss=0.2888, pruned_loss=0.05539, over 21649.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2948, pruned_loss=0.06957, over 4271530.26 frames. ], batch size: 263, lr: 4.44e-03, grad_scale: 16.0 2023-06-24 14:52:10,064 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.07 vs. limit=22.5 2023-06-24 14:52:46,472 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.62 vs. limit=15.0 2023-06-24 14:52:46,859 INFO [train.py:996] (0/4) Epoch 7, batch 4200, loss[loss=0.2474, simple_loss=0.3226, pruned_loss=0.08612, over 21304.00 frames. ], tot_loss[loss=0.2181, simple_loss=0.2955, pruned_loss=0.07036, over 4273105.51 frames. ], batch size: 548, lr: 4.43e-03, grad_scale: 16.0 2023-06-24 14:52:57,888 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.766e+02 2.672e+02 2.976e+02 3.504e+02 5.360e+02, threshold=5.952e+02, percent-clipped=0.0 2023-06-24 14:53:55,672 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 14:54:07,854 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1123182.0, ans=0.125 2023-06-24 14:54:08,307 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.27 vs. limit=15.0 2023-06-24 14:54:45,205 INFO [train.py:996] (0/4) Epoch 7, batch 4250, loss[loss=0.2446, simple_loss=0.3614, pruned_loss=0.06391, over 19762.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.3031, pruned_loss=0.07292, over 4270217.40 frames. ], batch size: 702, lr: 4.43e-03, grad_scale: 16.0 2023-06-24 14:54:47,326 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1123302.0, ans=0.04949747468305833 2023-06-24 14:54:58,483 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1123302.0, ans=0.0 2023-06-24 14:55:05,318 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1123302.0, ans=0.125 2023-06-24 14:55:11,259 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.97 vs. limit=10.0 2023-06-24 14:55:29,060 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1123362.0, ans=0.125 2023-06-24 14:55:40,216 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.31 vs. limit=6.0 2023-06-24 14:56:06,767 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1123482.0, ans=0.1 2023-06-24 14:56:43,632 INFO [train.py:996] (0/4) Epoch 7, batch 4300, loss[loss=0.2176, simple_loss=0.272, pruned_loss=0.08159, over 20222.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.3078, pruned_loss=0.07471, over 4270339.02 frames. ], batch size: 702, lr: 4.43e-03, grad_scale: 16.0 2023-06-24 14:56:45,898 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1123602.0, ans=0.125 2023-06-24 14:56:48,740 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.245e+02 3.063e+02 3.693e+02 4.827e+02 7.345e+02, threshold=7.385e+02, percent-clipped=7.0 2023-06-24 14:57:18,152 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1123662.0, ans=0.125 2023-06-24 14:57:24,687 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1123662.0, ans=0.2 2023-06-24 14:58:37,673 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.82 vs. limit=22.5 2023-06-24 14:58:39,586 INFO [train.py:996] (0/4) Epoch 7, batch 4350, loss[loss=0.195, simple_loss=0.2645, pruned_loss=0.06278, over 21898.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.3062, pruned_loss=0.07318, over 4268161.68 frames. ], batch size: 107, lr: 4.43e-03, grad_scale: 16.0 2023-06-24 14:59:47,245 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1124082.0, ans=0.125 2023-06-24 14:59:54,517 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1124082.0, ans=0.125 2023-06-24 14:59:58,061 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1124082.0, ans=0.1 2023-06-24 14:59:59,746 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1124142.0, ans=0.125 2023-06-24 15:00:24,118 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1124142.0, ans=0.0 2023-06-24 15:00:35,502 INFO [train.py:996] (0/4) Epoch 7, batch 4400, loss[loss=0.2163, simple_loss=0.3156, pruned_loss=0.05848, over 21782.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.3028, pruned_loss=0.07243, over 4265098.52 frames. ], batch size: 282, lr: 4.43e-03, grad_scale: 32.0 2023-06-24 15:00:41,295 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.078e+02 2.891e+02 3.329e+02 4.006e+02 7.259e+02, threshold=6.659e+02, percent-clipped=0.0 2023-06-24 15:00:46,292 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.69 vs. limit=22.5 2023-06-24 15:01:05,922 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1124262.0, ans=0.125 2023-06-24 15:01:21,244 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1124322.0, ans=0.09899494936611666 2023-06-24 15:01:48,468 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1124382.0, ans=0.1 2023-06-24 15:02:11,145 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.73 vs. limit=10.0 2023-06-24 15:02:28,254 INFO [train.py:996] (0/4) Epoch 7, batch 4450, loss[loss=0.2306, simple_loss=0.3044, pruned_loss=0.07846, over 21784.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3081, pruned_loss=0.07409, over 4266147.18 frames. ], batch size: 124, lr: 4.43e-03, grad_scale: 32.0 2023-06-24 15:04:20,243 INFO [train.py:996] (0/4) Epoch 7, batch 4500, loss[loss=0.2203, simple_loss=0.3115, pruned_loss=0.06451, over 21615.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.3092, pruned_loss=0.07528, over 4275235.02 frames. ], batch size: 230, lr: 4.43e-03, grad_scale: 32.0 2023-06-24 15:04:25,141 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.118e+02 2.933e+02 3.595e+02 4.328e+02 6.220e+02, threshold=7.189e+02, percent-clipped=0.0 2023-06-24 15:04:31,274 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1124802.0, ans=0.0 2023-06-24 15:04:34,749 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1124802.0, ans=0.125 2023-06-24 15:05:10,281 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1124922.0, ans=0.0 2023-06-24 15:06:07,995 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1125042.0, ans=0.125 2023-06-24 15:06:10,757 INFO [train.py:996] (0/4) Epoch 7, batch 4550, loss[loss=0.2215, simple_loss=0.2742, pruned_loss=0.08443, over 20714.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3116, pruned_loss=0.07566, over 4278262.38 frames. ], batch size: 607, lr: 4.43e-03, grad_scale: 32.0 2023-06-24 15:06:27,808 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1125162.0, ans=0.2 2023-06-24 15:06:45,807 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1125162.0, ans=0.2 2023-06-24 15:07:08,303 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1125222.0, ans=0.125 2023-06-24 15:07:29,476 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1125282.0, ans=0.1 2023-06-24 15:07:31,721 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1125282.0, ans=0.2 2023-06-24 15:07:56,754 INFO [train.py:996] (0/4) Epoch 7, batch 4600, loss[loss=0.2259, simple_loss=0.303, pruned_loss=0.07438, over 21911.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3127, pruned_loss=0.07689, over 4279953.77 frames. ], batch size: 107, lr: 4.43e-03, grad_scale: 32.0 2023-06-24 15:07:58,276 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.81 vs. limit=15.0 2023-06-24 15:08:02,274 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.231e+02 3.095e+02 3.765e+02 5.007e+02 9.113e+02, threshold=7.530e+02, percent-clipped=6.0 2023-06-24 15:09:25,336 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1125582.0, ans=0.125 2023-06-24 15:09:43,144 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=1125642.0, ans=15.0 2023-06-24 15:09:45,831 INFO [train.py:996] (0/4) Epoch 7, batch 4650, loss[loss=0.1999, simple_loss=0.274, pruned_loss=0.06287, over 21883.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3087, pruned_loss=0.07598, over 4289267.19 frames. ], batch size: 118, lr: 4.43e-03, grad_scale: 32.0 2023-06-24 15:10:14,908 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1125762.0, ans=0.0 2023-06-24 15:10:33,990 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.32 vs. limit=15.0 2023-06-24 15:11:06,411 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_na.min_abs, batch_count=1125882.0, ans=0.02 2023-06-24 15:11:32,650 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1125942.0, ans=0.0 2023-06-24 15:11:35,517 INFO [train.py:996] (0/4) Epoch 7, batch 4700, loss[loss=0.1891, simple_loss=0.2531, pruned_loss=0.06253, over 21595.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.2994, pruned_loss=0.07346, over 4280376.17 frames. ], batch size: 263, lr: 4.43e-03, grad_scale: 32.0 2023-06-24 15:11:37,670 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1126002.0, ans=10.0 2023-06-24 15:11:45,716 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.908e+02 2.572e+02 2.876e+02 3.232e+02 6.204e+02, threshold=5.752e+02, percent-clipped=0.0 2023-06-24 15:11:59,868 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=1126062.0, ans=10.0 2023-06-24 15:11:59,888 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1126062.0, ans=0.0 2023-06-24 15:12:29,943 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1126122.0, ans=0.125 2023-06-24 15:12:30,717 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.42 vs. limit=22.5 2023-06-24 15:13:17,051 INFO [train.py:996] (0/4) Epoch 7, batch 4750, loss[loss=0.2114, simple_loss=0.2728, pruned_loss=0.07504, over 21765.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.2942, pruned_loss=0.0731, over 4269305.23 frames. ], batch size: 247, lr: 4.43e-03, grad_scale: 32.0 2023-06-24 15:13:37,794 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1126302.0, ans=0.1 2023-06-24 15:15:04,111 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1126542.0, ans=0.1 2023-06-24 15:15:11,028 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1126542.0, ans=0.125 2023-06-24 15:15:13,766 INFO [train.py:996] (0/4) Epoch 7, batch 4800, loss[loss=0.2155, simple_loss=0.2916, pruned_loss=0.06974, over 21732.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.2957, pruned_loss=0.07347, over 4271957.05 frames. ], batch size: 112, lr: 4.43e-03, grad_scale: 32.0 2023-06-24 15:15:18,316 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1126602.0, ans=0.07 2023-06-24 15:15:19,156 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.126e+02 2.780e+02 3.342e+02 3.933e+02 6.055e+02, threshold=6.684e+02, percent-clipped=1.0 2023-06-24 15:15:23,457 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1126602.0, ans=0.0 2023-06-24 15:15:24,910 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1126602.0, ans=0.2 2023-06-24 15:15:36,405 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1126662.0, ans=0.1 2023-06-24 15:16:15,436 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1126722.0, ans=0.125 2023-06-24 15:16:59,128 INFO [train.py:996] (0/4) Epoch 7, batch 4850, loss[loss=0.1983, simple_loss=0.2754, pruned_loss=0.06064, over 21767.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.2933, pruned_loss=0.07305, over 4268854.68 frames. ], batch size: 282, lr: 4.43e-03, grad_scale: 32.0 2023-06-24 15:16:59,777 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1126902.0, ans=0.07 2023-06-24 15:17:12,720 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.60 vs. limit=12.0 2023-06-24 15:17:39,004 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1127022.0, ans=0.0 2023-06-24 15:17:40,716 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 15:18:03,644 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1127022.0, ans=0.125 2023-06-24 15:18:22,662 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.72 vs. limit=10.0 2023-06-24 15:18:25,967 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1127142.0, ans=0.125 2023-06-24 15:18:50,504 INFO [train.py:996] (0/4) Epoch 7, batch 4900, loss[loss=0.2482, simple_loss=0.3216, pruned_loss=0.08739, over 21345.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.2979, pruned_loss=0.07411, over 4269715.89 frames. ], batch size: 549, lr: 4.43e-03, grad_scale: 32.0 2023-06-24 15:18:55,340 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.943e+02 2.644e+02 3.017e+02 3.473e+02 6.026e+02, threshold=6.033e+02, percent-clipped=0.0 2023-06-24 15:19:47,636 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1127322.0, ans=0.125 2023-06-24 15:20:01,643 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1127382.0, ans=0.0 2023-06-24 15:20:02,166 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.44 vs. limit=15.0 2023-06-24 15:20:41,536 INFO [train.py:996] (0/4) Epoch 7, batch 4950, loss[loss=0.2087, simple_loss=0.3097, pruned_loss=0.05388, over 21713.00 frames. ], tot_loss[loss=0.222, simple_loss=0.2996, pruned_loss=0.07218, over 4265258.33 frames. ], batch size: 298, lr: 4.43e-03, grad_scale: 32.0 2023-06-24 15:21:22,558 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1127562.0, ans=0.125 2023-06-24 15:22:01,259 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1127682.0, ans=0.0 2023-06-24 15:22:20,033 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.57 vs. limit=15.0 2023-06-24 15:22:30,519 INFO [train.py:996] (0/4) Epoch 7, batch 5000, loss[loss=0.2114, simple_loss=0.2714, pruned_loss=0.07567, over 20124.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.3002, pruned_loss=0.06985, over 4274488.92 frames. ], batch size: 702, lr: 4.43e-03, grad_scale: 32.0 2023-06-24 15:22:35,396 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.800e+02 2.510e+02 2.912e+02 3.367e+02 5.959e+02, threshold=5.824e+02, percent-clipped=0.0 2023-06-24 15:23:40,788 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-188000.pt 2023-06-24 15:24:19,961 INFO [train.py:996] (0/4) Epoch 7, batch 5050, loss[loss=0.2341, simple_loss=0.3463, pruned_loss=0.06102, over 20700.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.3003, pruned_loss=0.07131, over 4283443.67 frames. ], batch size: 607, lr: 4.42e-03, grad_scale: 32.0 2023-06-24 15:24:34,665 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1128102.0, ans=0.125 2023-06-24 15:26:10,466 INFO [train.py:996] (0/4) Epoch 7, batch 5100, loss[loss=0.1968, simple_loss=0.2643, pruned_loss=0.06463, over 21454.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.299, pruned_loss=0.07172, over 4283386.25 frames. ], batch size: 194, lr: 4.42e-03, grad_scale: 16.0 2023-06-24 15:26:17,207 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.025e+02 2.691e+02 3.129e+02 3.589e+02 6.328e+02, threshold=6.257e+02, percent-clipped=2.0 2023-06-24 15:26:23,118 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1128402.0, ans=0.125 2023-06-24 15:27:23,183 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1128582.0, ans=0.125 2023-06-24 15:27:26,354 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1128582.0, ans=0.0 2023-06-24 15:28:00,603 INFO [train.py:996] (0/4) Epoch 7, batch 5150, loss[loss=0.2464, simple_loss=0.3087, pruned_loss=0.09204, over 21842.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.2967, pruned_loss=0.07237, over 4289445.26 frames. ], batch size: 351, lr: 4.42e-03, grad_scale: 16.0 2023-06-24 15:28:07,238 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1128702.0, ans=0.125 2023-06-24 15:29:52,276 INFO [train.py:996] (0/4) Epoch 7, batch 5200, loss[loss=0.2278, simple_loss=0.322, pruned_loss=0.06681, over 21727.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.3, pruned_loss=0.07251, over 4289360.59 frames. ], batch size: 247, lr: 4.42e-03, grad_scale: 32.0 2023-06-24 15:29:59,475 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.122e+02 2.769e+02 3.246e+02 4.133e+02 8.749e+02, threshold=6.492e+02, percent-clipped=7.0 2023-06-24 15:31:04,092 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.06 vs. limit=15.0 2023-06-24 15:31:41,118 INFO [train.py:996] (0/4) Epoch 7, batch 5250, loss[loss=0.1618, simple_loss=0.2329, pruned_loss=0.04533, over 21895.00 frames. ], tot_loss[loss=0.222, simple_loss=0.3026, pruned_loss=0.0707, over 4286157.13 frames. ], batch size: 98, lr: 4.42e-03, grad_scale: 32.0 2023-06-24 15:31:41,806 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1129302.0, ans=0.1 2023-06-24 15:31:47,052 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1129302.0, ans=0.0 2023-06-24 15:32:33,554 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1129422.0, ans=0.1 2023-06-24 15:33:04,410 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1129482.0, ans=0.125 2023-06-24 15:33:11,369 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1129482.0, ans=0.1 2023-06-24 15:33:27,326 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1129542.0, ans=0.1 2023-06-24 15:33:31,871 INFO [train.py:996] (0/4) Epoch 7, batch 5300, loss[loss=0.2349, simple_loss=0.3049, pruned_loss=0.08243, over 21881.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.302, pruned_loss=0.07125, over 4287874.28 frames. ], batch size: 371, lr: 4.42e-03, grad_scale: 32.0 2023-06-24 15:33:38,417 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.003e+02 2.522e+02 2.825e+02 3.420e+02 5.349e+02, threshold=5.650e+02, percent-clipped=0.0 2023-06-24 15:34:18,124 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1129662.0, ans=0.125 2023-06-24 15:35:17,657 INFO [train.py:996] (0/4) Epoch 7, batch 5350, loss[loss=0.2307, simple_loss=0.3008, pruned_loss=0.08031, over 21743.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.3009, pruned_loss=0.0725, over 4293392.18 frames. ], batch size: 389, lr: 4.42e-03, grad_scale: 16.0 2023-06-24 15:35:25,236 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1129902.0, ans=0.125 2023-06-24 15:36:40,998 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1130082.0, ans=0.125 2023-06-24 15:36:56,916 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1130142.0, ans=0.0 2023-06-24 15:37:02,441 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1130142.0, ans=0.125 2023-06-24 15:37:07,125 INFO [train.py:996] (0/4) Epoch 7, batch 5400, loss[loss=0.2243, simple_loss=0.2913, pruned_loss=0.07862, over 21354.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.2986, pruned_loss=0.07308, over 4296016.53 frames. ], batch size: 144, lr: 4.42e-03, grad_scale: 16.0 2023-06-24 15:37:16,432 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.105e+02 2.748e+02 3.020e+02 3.535e+02 6.573e+02, threshold=6.041e+02, percent-clipped=2.0 2023-06-24 15:37:33,937 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.74 vs. limit=6.0 2023-06-24 15:37:35,406 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1130262.0, ans=0.0 2023-06-24 15:37:42,563 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.70 vs. limit=15.0 2023-06-24 15:38:17,852 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1130382.0, ans=0.125 2023-06-24 15:38:59,046 INFO [train.py:996] (0/4) Epoch 7, batch 5450, loss[loss=0.2025, simple_loss=0.2896, pruned_loss=0.0577, over 21429.00 frames. ], tot_loss[loss=0.221, simple_loss=0.2989, pruned_loss=0.07156, over 4297498.25 frames. ], batch size: 194, lr: 4.42e-03, grad_scale: 16.0 2023-06-24 15:39:08,770 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.04 vs. limit=15.0 2023-06-24 15:39:51,282 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1130622.0, ans=0.125 2023-06-24 15:39:51,829 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.58 vs. limit=15.0 2023-06-24 15:40:21,593 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1130742.0, ans=0.0 2023-06-24 15:40:35,919 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1130742.0, ans=0.2 2023-06-24 15:40:50,142 INFO [train.py:996] (0/4) Epoch 7, batch 5500, loss[loss=0.2442, simple_loss=0.3463, pruned_loss=0.07109, over 21210.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.3036, pruned_loss=0.06935, over 4289158.84 frames. ], batch size: 548, lr: 4.42e-03, grad_scale: 16.0 2023-06-24 15:40:52,200 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1130802.0, ans=0.1 2023-06-24 15:40:58,244 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.932e+02 2.852e+02 3.783e+02 5.353e+02 8.274e+02, threshold=7.565e+02, percent-clipped=13.0 2023-06-24 15:42:26,706 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1131042.0, ans=0.0 2023-06-24 15:42:40,373 INFO [train.py:996] (0/4) Epoch 7, batch 5550, loss[loss=0.1788, simple_loss=0.2763, pruned_loss=0.04069, over 21684.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.3041, pruned_loss=0.06769, over 4287745.64 frames. ], batch size: 298, lr: 4.42e-03, grad_scale: 16.0 2023-06-24 15:43:26,775 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1131222.0, ans=0.0 2023-06-24 15:44:20,463 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1131342.0, ans=0.125 2023-06-24 15:44:30,787 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1131402.0, ans=0.2 2023-06-24 15:44:31,790 INFO [train.py:996] (0/4) Epoch 7, batch 5600, loss[loss=0.312, simple_loss=0.405, pruned_loss=0.1095, over 21525.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.3005, pruned_loss=0.0649, over 4292103.06 frames. ], batch size: 471, lr: 4.42e-03, grad_scale: 32.0 2023-06-24 15:44:45,775 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.709e+02 2.480e+02 2.959e+02 3.871e+02 8.894e+02, threshold=5.918e+02, percent-clipped=1.0 2023-06-24 15:45:01,851 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1131462.0, ans=0.125 2023-06-24 15:45:31,643 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1131522.0, ans=0.125 2023-06-24 15:45:40,251 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1131582.0, ans=0.125 2023-06-24 15:45:42,021 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1131582.0, ans=0.0 2023-06-24 15:45:42,646 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.59 vs. limit=22.5 2023-06-24 15:46:15,112 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1131642.0, ans=0.125 2023-06-24 15:46:17,200 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.27 vs. limit=15.0 2023-06-24 15:46:19,841 INFO [train.py:996] (0/4) Epoch 7, batch 5650, loss[loss=0.2451, simple_loss=0.317, pruned_loss=0.08657, over 21728.00 frames. ], tot_loss[loss=0.219, simple_loss=0.3038, pruned_loss=0.06713, over 4285370.76 frames. ], batch size: 389, lr: 4.42e-03, grad_scale: 32.0 2023-06-24 15:48:12,107 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1131942.0, ans=0.0 2023-06-24 15:48:15,337 INFO [train.py:996] (0/4) Epoch 7, batch 5700, loss[loss=0.2152, simple_loss=0.2891, pruned_loss=0.07069, over 21336.00 frames. ], tot_loss[loss=0.221, simple_loss=0.3031, pruned_loss=0.06949, over 4292076.47 frames. ], batch size: 159, lr: 4.42e-03, grad_scale: 16.0 2023-06-24 15:48:25,352 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1132002.0, ans=0.1 2023-06-24 15:48:26,251 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.045e+02 2.629e+02 3.066e+02 3.731e+02 7.827e+02, threshold=6.133e+02, percent-clipped=4.0 2023-06-24 15:49:58,480 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1132242.0, ans=10.0 2023-06-24 15:50:01,746 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1132242.0, ans=0.125 2023-06-24 15:50:06,421 INFO [train.py:996] (0/4) Epoch 7, batch 5750, loss[loss=0.1935, simple_loss=0.2862, pruned_loss=0.05036, over 21776.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.2981, pruned_loss=0.06672, over 4285776.28 frames. ], batch size: 333, lr: 4.42e-03, grad_scale: 16.0 2023-06-24 15:50:57,502 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1132422.0, ans=0.04949747468305833 2023-06-24 15:51:53,876 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=1132542.0, ans=6.0 2023-06-24 15:51:56,246 INFO [train.py:996] (0/4) Epoch 7, batch 5800, loss[loss=0.21, simple_loss=0.3056, pruned_loss=0.05722, over 21754.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2978, pruned_loss=0.06544, over 4279541.89 frames. ], batch size: 282, lr: 4.42e-03, grad_scale: 16.0 2023-06-24 15:52:12,050 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.870e+02 2.681e+02 3.323e+02 4.302e+02 6.884e+02, threshold=6.646e+02, percent-clipped=1.0 2023-06-24 15:52:45,226 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1132662.0, ans=0.0 2023-06-24 15:52:48,920 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1132722.0, ans=0.125 2023-06-24 15:53:35,688 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1132842.0, ans=0.125 2023-06-24 15:53:58,498 INFO [train.py:996] (0/4) Epoch 7, batch 5850, loss[loss=0.1928, simple_loss=0.3018, pruned_loss=0.04187, over 21643.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.2949, pruned_loss=0.06186, over 4272665.14 frames. ], batch size: 414, lr: 4.42e-03, grad_scale: 16.0 2023-06-24 15:54:14,617 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1132902.0, ans=0.035 2023-06-24 15:54:42,648 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=1133022.0, ans=10.0 2023-06-24 15:54:58,513 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1133022.0, ans=0.0 2023-06-24 15:55:06,006 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.42 vs. limit=15.0 2023-06-24 15:55:18,642 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.40 vs. limit=15.0 2023-06-24 15:55:29,540 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1133142.0, ans=0.1 2023-06-24 15:55:34,596 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1133142.0, ans=0.05 2023-06-24 15:55:51,564 INFO [train.py:996] (0/4) Epoch 7, batch 5900, loss[loss=0.1776, simple_loss=0.2637, pruned_loss=0.04574, over 21643.00 frames. ], tot_loss[loss=0.2017, simple_loss=0.2884, pruned_loss=0.05747, over 4271289.70 frames. ], batch size: 263, lr: 4.41e-03, grad_scale: 16.0 2023-06-24 15:55:51,979 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1133202.0, ans=0.035 2023-06-24 15:56:01,771 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.413e+02 2.024e+02 2.372e+02 2.933e+02 6.586e+02, threshold=4.744e+02, percent-clipped=0.0 2023-06-24 15:56:16,612 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1133262.0, ans=0.5 2023-06-24 15:57:17,934 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1133442.0, ans=0.1 2023-06-24 15:57:39,664 INFO [train.py:996] (0/4) Epoch 7, batch 5950, loss[loss=0.2337, simple_loss=0.345, pruned_loss=0.06118, over 21193.00 frames. ], tot_loss[loss=0.2047, simple_loss=0.2881, pruned_loss=0.06067, over 4268237.97 frames. ], batch size: 548, lr: 4.41e-03, grad_scale: 16.0 2023-06-24 15:58:20,412 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1133622.0, ans=0.0 2023-06-24 15:58:42,678 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1133682.0, ans=0.0 2023-06-24 15:59:01,521 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1133742.0, ans=0.0 2023-06-24 15:59:10,938 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1133742.0, ans=0.2 2023-06-24 15:59:17,548 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1133742.0, ans=0.1 2023-06-24 15:59:27,089 INFO [train.py:996] (0/4) Epoch 7, batch 6000, loss[loss=0.1991, simple_loss=0.2644, pruned_loss=0.06686, over 21807.00 frames. ], tot_loss[loss=0.2058, simple_loss=0.2849, pruned_loss=0.06339, over 4278249.57 frames. ], batch size: 107, lr: 4.41e-03, grad_scale: 32.0 2023-06-24 15:59:27,090 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-24 15:59:44,452 INFO [train.py:1028] (0/4) Epoch 7, validation: loss=0.2613, simple_loss=0.3539, pruned_loss=0.08436, over 1796401.00 frames. 2023-06-24 15:59:44,454 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23616MB 2023-06-24 15:59:45,902 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.92 vs. limit=15.0 2023-06-24 15:59:57,240 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.008e+02 3.144e+02 3.731e+02 4.665e+02 6.977e+02, threshold=7.462e+02, percent-clipped=24.0 2023-06-24 16:01:04,460 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.88 vs. limit=6.0 2023-06-24 16:01:16,402 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1133982.0, ans=0.125 2023-06-24 16:01:26,965 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1134042.0, ans=0.2 2023-06-24 16:01:35,795 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 16:01:36,709 INFO [train.py:996] (0/4) Epoch 7, batch 6050, loss[loss=0.2054, simple_loss=0.2928, pruned_loss=0.05899, over 21383.00 frames. ], tot_loss[loss=0.2041, simple_loss=0.2814, pruned_loss=0.06344, over 4281922.62 frames. ], batch size: 471, lr: 4.41e-03, grad_scale: 16.0 2023-06-24 16:02:08,836 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1134162.0, ans=0.0 2023-06-24 16:03:19,570 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1134342.0, ans=0.035 2023-06-24 16:03:24,862 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1134342.0, ans=0.125 2023-06-24 16:03:27,728 INFO [train.py:996] (0/4) Epoch 7, batch 6100, loss[loss=0.252, simple_loss=0.3116, pruned_loss=0.09619, over 21349.00 frames. ], tot_loss[loss=0.2017, simple_loss=0.279, pruned_loss=0.06218, over 4284738.95 frames. ], batch size: 159, lr: 4.41e-03, grad_scale: 16.0 2023-06-24 16:03:39,884 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.866e+02 2.425e+02 2.947e+02 3.693e+02 6.413e+02, threshold=5.895e+02, percent-clipped=0.0 2023-06-24 16:03:49,613 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1134462.0, ans=0.07 2023-06-24 16:04:07,223 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1134522.0, ans=0.125 2023-06-24 16:04:39,769 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.08 vs. limit=22.5 2023-06-24 16:05:17,165 INFO [train.py:996] (0/4) Epoch 7, batch 6150, loss[loss=0.2046, simple_loss=0.2783, pruned_loss=0.06548, over 21608.00 frames. ], tot_loss[loss=0.2059, simple_loss=0.2824, pruned_loss=0.06472, over 4287334.83 frames. ], batch size: 230, lr: 4.41e-03, grad_scale: 8.0 2023-06-24 16:05:41,030 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.71 vs. limit=12.0 2023-06-24 16:06:52,076 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1134942.0, ans=0.125 2023-06-24 16:06:56,522 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.23 vs. limit=15.0 2023-06-24 16:07:01,070 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1134942.0, ans=0.125 2023-06-24 16:07:04,741 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1135002.0, ans=0.1 2023-06-24 16:07:05,673 INFO [train.py:996] (0/4) Epoch 7, batch 6200, loss[loss=0.2527, simple_loss=0.3309, pruned_loss=0.0872, over 21860.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.287, pruned_loss=0.0661, over 4283643.50 frames. ], batch size: 371, lr: 4.41e-03, grad_scale: 8.0 2023-06-24 16:07:25,605 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.918e+02 2.569e+02 3.119e+02 3.567e+02 5.212e+02, threshold=6.237e+02, percent-clipped=0.0 2023-06-24 16:07:35,351 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1135062.0, ans=0.125 2023-06-24 16:08:13,598 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1135122.0, ans=0.125 2023-06-24 16:08:56,830 INFO [train.py:996] (0/4) Epoch 7, batch 6250, loss[loss=0.2043, simple_loss=0.3044, pruned_loss=0.05209, over 21835.00 frames. ], tot_loss[loss=0.2129, simple_loss=0.293, pruned_loss=0.06639, over 4277978.22 frames. ], batch size: 316, lr: 4.41e-03, grad_scale: 8.0 2023-06-24 16:09:07,880 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1135302.0, ans=0.125 2023-06-24 16:09:15,583 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.59 vs. limit=12.0 2023-06-24 16:09:29,414 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1135362.0, ans=10.0 2023-06-24 16:10:11,452 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1135482.0, ans=0.125 2023-06-24 16:10:15,394 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1135482.0, ans=0.125 2023-06-24 16:10:38,577 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1135542.0, ans=0.1 2023-06-24 16:10:51,876 INFO [train.py:996] (0/4) Epoch 7, batch 6300, loss[loss=0.2022, simple_loss=0.2764, pruned_loss=0.06402, over 21645.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2967, pruned_loss=0.06553, over 4286191.80 frames. ], batch size: 230, lr: 4.41e-03, grad_scale: 8.0 2023-06-24 16:11:01,830 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1135602.0, ans=0.0 2023-06-24 16:11:06,077 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.925e+02 2.617e+02 3.122e+02 4.088e+02 6.551e+02, threshold=6.244e+02, percent-clipped=1.0 2023-06-24 16:11:35,970 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1135722.0, ans=0.0 2023-06-24 16:11:37,643 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1135722.0, ans=0.2 2023-06-24 16:11:41,416 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1135722.0, ans=0.0 2023-06-24 16:11:48,451 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1135722.0, ans=0.1 2023-06-24 16:12:07,633 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1135782.0, ans=0.1 2023-06-24 16:12:40,748 INFO [train.py:996] (0/4) Epoch 7, batch 6350, loss[loss=0.2835, simple_loss=0.3431, pruned_loss=0.112, over 21801.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2992, pruned_loss=0.0688, over 4286284.47 frames. ], batch size: 441, lr: 4.41e-03, grad_scale: 8.0 2023-06-24 16:13:18,054 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1135962.0, ans=0.07 2023-06-24 16:13:34,120 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.69 vs. limit=15.0 2023-06-24 16:14:27,721 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1136142.0, ans=0.1 2023-06-24 16:14:30,433 INFO [train.py:996] (0/4) Epoch 7, batch 6400, loss[loss=0.237, simple_loss=0.3136, pruned_loss=0.0802, over 21705.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.304, pruned_loss=0.0725, over 4282048.79 frames. ], batch size: 351, lr: 4.41e-03, grad_scale: 16.0 2023-06-24 16:14:55,415 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.098e+02 2.966e+02 3.361e+02 3.840e+02 6.220e+02, threshold=6.721e+02, percent-clipped=0.0 2023-06-24 16:16:01,612 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1136442.0, ans=0.125 2023-06-24 16:16:25,895 INFO [train.py:996] (0/4) Epoch 7, batch 6450, loss[loss=0.2784, simple_loss=0.3382, pruned_loss=0.1093, over 21391.00 frames. ], tot_loss[loss=0.227, simple_loss=0.3081, pruned_loss=0.07294, over 4281410.16 frames. ], batch size: 507, lr: 4.41e-03, grad_scale: 16.0 2023-06-24 16:16:50,804 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1136502.0, ans=0.2 2023-06-24 16:17:29,558 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1136622.0, ans=0.0 2023-06-24 16:17:41,912 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1136682.0, ans=0.1 2023-06-24 16:18:14,964 INFO [train.py:996] (0/4) Epoch 7, batch 6500, loss[loss=0.2323, simple_loss=0.3098, pruned_loss=0.07735, over 21577.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.3022, pruned_loss=0.07159, over 4277008.83 frames. ], batch size: 441, lr: 4.41e-03, grad_scale: 16.0 2023-06-24 16:18:28,850 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1136802.0, ans=0.0 2023-06-24 16:18:28,951 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1136802.0, ans=0.0 2023-06-24 16:18:38,260 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.984e+02 2.852e+02 3.600e+02 4.849e+02 8.797e+02, threshold=7.199e+02, percent-clipped=3.0 2023-06-24 16:18:38,850 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1136802.0, ans=0.0 2023-06-24 16:19:07,909 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.47 vs. limit=22.5 2023-06-24 16:19:10,346 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1136922.0, ans=0.125 2023-06-24 16:19:47,724 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1137042.0, ans=0.125 2023-06-24 16:19:58,962 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1137042.0, ans=0.1 2023-06-24 16:20:03,508 INFO [train.py:996] (0/4) Epoch 7, batch 6550, loss[loss=0.2338, simple_loss=0.3052, pruned_loss=0.08125, over 21857.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.3004, pruned_loss=0.0711, over 4270122.40 frames. ], batch size: 371, lr: 4.41e-03, grad_scale: 16.0 2023-06-24 16:20:40,477 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1137162.0, ans=0.05 2023-06-24 16:20:43,618 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1137162.0, ans=0.125 2023-06-24 16:21:05,558 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.15 vs. limit=15.0 2023-06-24 16:21:24,985 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1137282.0, ans=0.125 2023-06-24 16:21:41,986 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1137342.0, ans=0.125 2023-06-24 16:21:53,188 INFO [train.py:996] (0/4) Epoch 7, batch 6600, loss[loss=0.1831, simple_loss=0.2483, pruned_loss=0.05895, over 21255.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2944, pruned_loss=0.07015, over 4273789.27 frames. ], batch size: 176, lr: 4.41e-03, grad_scale: 16.0 2023-06-24 16:22:17,170 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.905e+02 2.518e+02 2.917e+02 3.263e+02 5.305e+02, threshold=5.833e+02, percent-clipped=0.0 2023-06-24 16:22:21,115 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.31 vs. limit=15.0 2023-06-24 16:23:07,743 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1137582.0, ans=0.125 2023-06-24 16:23:53,019 INFO [train.py:996] (0/4) Epoch 7, batch 6650, loss[loss=0.1949, simple_loss=0.2673, pruned_loss=0.0612, over 21743.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2865, pruned_loss=0.06815, over 4275318.58 frames. ], batch size: 352, lr: 4.41e-03, grad_scale: 16.0 2023-06-24 16:25:12,260 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1137882.0, ans=0.0 2023-06-24 16:25:41,013 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1137942.0, ans=0.125 2023-06-24 16:25:43,958 INFO [train.py:996] (0/4) Epoch 7, batch 6700, loss[loss=0.1864, simple_loss=0.2579, pruned_loss=0.05747, over 21742.00 frames. ], tot_loss[loss=0.2087, simple_loss=0.2812, pruned_loss=0.06804, over 4268398.78 frames. ], batch size: 124, lr: 4.41e-03, grad_scale: 16.0 2023-06-24 16:25:45,889 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1138002.0, ans=0.2 2023-06-24 16:25:57,327 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.898e+02 2.457e+02 2.786e+02 3.230e+02 4.297e+02, threshold=5.572e+02, percent-clipped=0.0 2023-06-24 16:26:26,305 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1138122.0, ans=0.125 2023-06-24 16:26:30,262 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.70 vs. limit=22.5 2023-06-24 16:27:26,667 INFO [train.py:996] (0/4) Epoch 7, batch 6750, loss[loss=0.2271, simple_loss=0.2985, pruned_loss=0.07785, over 21834.00 frames. ], tot_loss[loss=0.2078, simple_loss=0.2791, pruned_loss=0.06825, over 4261436.61 frames. ], batch size: 371, lr: 4.40e-03, grad_scale: 16.0 2023-06-24 16:27:58,273 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1138362.0, ans=0.125 2023-06-24 16:28:19,612 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1138422.0, ans=0.125 2023-06-24 16:28:49,725 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1138542.0, ans=0.0 2023-06-24 16:29:04,967 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1138542.0, ans=0.0 2023-06-24 16:29:09,315 INFO [train.py:996] (0/4) Epoch 7, batch 6800, loss[loss=0.1978, simple_loss=0.2714, pruned_loss=0.06211, over 21803.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.2813, pruned_loss=0.07029, over 4270331.95 frames. ], batch size: 247, lr: 4.40e-03, grad_scale: 32.0 2023-06-24 16:29:23,227 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.205e+02 2.710e+02 3.194e+02 3.747e+02 5.784e+02, threshold=6.389e+02, percent-clipped=2.0 2023-06-24 16:30:02,108 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1138722.0, ans=0.09899494936611666 2023-06-24 16:30:20,751 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1138782.0, ans=0.125 2023-06-24 16:30:51,562 INFO [train.py:996] (0/4) Epoch 7, batch 6850, loss[loss=0.2106, simple_loss=0.2855, pruned_loss=0.06786, over 21858.00 frames. ], tot_loss[loss=0.2118, simple_loss=0.2813, pruned_loss=0.07115, over 4272691.34 frames. ], batch size: 316, lr: 4.40e-03, grad_scale: 16.0 2023-06-24 16:30:58,923 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1138902.0, ans=0.0 2023-06-24 16:32:01,136 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1139082.0, ans=0.05 2023-06-24 16:32:17,967 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.30 vs. limit=15.0 2023-06-24 16:32:41,615 INFO [train.py:996] (0/4) Epoch 7, batch 6900, loss[loss=0.1712, simple_loss=0.2599, pruned_loss=0.04129, over 21311.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2831, pruned_loss=0.07111, over 4270184.75 frames. ], batch size: 176, lr: 4.40e-03, grad_scale: 16.0 2023-06-24 16:33:03,120 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.166e+02 2.809e+02 3.309e+02 4.065e+02 7.013e+02, threshold=6.619e+02, percent-clipped=1.0 2023-06-24 16:33:05,469 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1139262.0, ans=0.125 2023-06-24 16:34:05,677 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1139382.0, ans=0.125 2023-06-24 16:34:37,859 INFO [train.py:996] (0/4) Epoch 7, batch 6950, loss[loss=0.2429, simple_loss=0.3256, pruned_loss=0.08008, over 21577.00 frames. ], tot_loss[loss=0.2108, simple_loss=0.2853, pruned_loss=0.06817, over 4270147.52 frames. ], batch size: 414, lr: 4.40e-03, grad_scale: 16.0 2023-06-24 16:35:00,017 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 16:35:17,868 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1139622.0, ans=0.125 2023-06-24 16:36:03,724 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1139742.0, ans=0.125 2023-06-24 16:36:27,744 INFO [train.py:996] (0/4) Epoch 7, batch 7000, loss[loss=0.1845, simple_loss=0.2535, pruned_loss=0.05776, over 21686.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2891, pruned_loss=0.07068, over 4260739.71 frames. ], batch size: 282, lr: 4.40e-03, grad_scale: 16.0 2023-06-24 16:36:49,409 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.743e+02 2.855e+02 3.392e+02 4.148e+02 6.941e+02, threshold=6.785e+02, percent-clipped=1.0 2023-06-24 16:36:58,942 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1139862.0, ans=0.125 2023-06-24 16:37:20,586 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1139922.0, ans=0.1 2023-06-24 16:37:20,793 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1139922.0, ans=0.05 2023-06-24 16:37:51,251 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1139982.0, ans=0.125 2023-06-24 16:38:07,032 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1140042.0, ans=0.025 2023-06-24 16:38:18,594 INFO [train.py:996] (0/4) Epoch 7, batch 7050, loss[loss=0.2065, simple_loss=0.2652, pruned_loss=0.07386, over 21559.00 frames. ], tot_loss[loss=0.2133, simple_loss=0.287, pruned_loss=0.0698, over 4249337.02 frames. ], batch size: 441, lr: 4.40e-03, grad_scale: 16.0 2023-06-24 16:38:25,827 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1140102.0, ans=0.125 2023-06-24 16:38:40,149 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1140162.0, ans=0.125 2023-06-24 16:39:38,121 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1140282.0, ans=0.125 2023-06-24 16:40:00,412 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.80 vs. limit=10.0 2023-06-24 16:40:15,912 INFO [train.py:996] (0/4) Epoch 7, batch 7100, loss[loss=0.2074, simple_loss=0.2824, pruned_loss=0.06615, over 21355.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2915, pruned_loss=0.07127, over 4255162.23 frames. ], batch size: 131, lr: 4.40e-03, grad_scale: 16.0 2023-06-24 16:40:31,716 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.970e+02 2.792e+02 3.207e+02 3.771e+02 5.994e+02, threshold=6.414e+02, percent-clipped=0.0 2023-06-24 16:40:32,435 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1140462.0, ans=0.125 2023-06-24 16:40:46,741 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1140462.0, ans=0.125 2023-06-24 16:41:22,001 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.12 vs. limit=10.0 2023-06-24 16:41:42,280 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1140642.0, ans=0.1 2023-06-24 16:42:06,521 INFO [train.py:996] (0/4) Epoch 7, batch 7150, loss[loss=0.2302, simple_loss=0.3041, pruned_loss=0.07817, over 21609.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.288, pruned_loss=0.06874, over 4250857.96 frames. ], batch size: 263, lr: 4.40e-03, grad_scale: 16.0 2023-06-24 16:42:13,275 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.14 vs. limit=12.0 2023-06-24 16:42:28,610 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1140762.0, ans=0.07 2023-06-24 16:43:06,240 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1140822.0, ans=0.1 2023-06-24 16:43:41,827 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.70 vs. limit=15.0 2023-06-24 16:43:56,406 INFO [train.py:996] (0/4) Epoch 7, batch 7200, loss[loss=0.2661, simple_loss=0.3088, pruned_loss=0.1117, over 21241.00 frames. ], tot_loss[loss=0.217, simple_loss=0.291, pruned_loss=0.0715, over 4251416.82 frames. ], batch size: 471, lr: 4.40e-03, grad_scale: 32.0 2023-06-24 16:44:12,346 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.853e+02 2.840e+02 3.235e+02 4.044e+02 5.731e+02, threshold=6.469e+02, percent-clipped=0.0 2023-06-24 16:44:20,224 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1141062.0, ans=0.125 2023-06-24 16:44:22,013 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1141062.0, ans=0.0 2023-06-24 16:44:44,601 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1141122.0, ans=0.05 2023-06-24 16:45:21,356 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1141182.0, ans=0.0 2023-06-24 16:45:28,076 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1141242.0, ans=0.1 2023-06-24 16:45:32,042 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1141242.0, ans=0.125 2023-06-24 16:45:35,443 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1141242.0, ans=0.1 2023-06-24 16:45:45,335 INFO [train.py:996] (0/4) Epoch 7, batch 7250, loss[loss=0.2061, simple_loss=0.2717, pruned_loss=0.07029, over 21182.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2878, pruned_loss=0.07148, over 4250834.90 frames. ], batch size: 176, lr: 4.40e-03, grad_scale: 16.0 2023-06-24 16:45:50,928 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1141302.0, ans=0.1 2023-06-24 16:46:36,960 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1141422.0, ans=0.2 2023-06-24 16:46:42,671 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=1141422.0, ans=15.0 2023-06-24 16:46:42,816 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.04 vs. limit=22.5 2023-06-24 16:47:12,675 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.06 vs. limit=15.0 2023-06-24 16:47:34,243 INFO [train.py:996] (0/4) Epoch 7, batch 7300, loss[loss=0.2004, simple_loss=0.2631, pruned_loss=0.06882, over 21814.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.282, pruned_loss=0.07049, over 4258208.81 frames. ], batch size: 352, lr: 4.40e-03, grad_scale: 16.0 2023-06-24 16:47:50,976 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.86 vs. limit=15.0 2023-06-24 16:47:51,209 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.050e+02 2.579e+02 3.088e+02 3.610e+02 6.583e+02, threshold=6.177e+02, percent-clipped=0.0 2023-06-24 16:48:55,021 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1141782.0, ans=0.0 2023-06-24 16:49:20,276 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1141842.0, ans=0.5 2023-06-24 16:49:25,154 INFO [train.py:996] (0/4) Epoch 7, batch 7350, loss[loss=0.2666, simple_loss=0.3366, pruned_loss=0.09827, over 21811.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.2814, pruned_loss=0.07082, over 4263549.63 frames. ], batch size: 118, lr: 4.40e-03, grad_scale: 16.0 2023-06-24 16:49:29,188 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1141902.0, ans=0.1 2023-06-24 16:50:17,651 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.56 vs. limit=15.0 2023-06-24 16:50:48,342 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.43 vs. limit=6.0 2023-06-24 16:50:56,213 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1142142.0, ans=0.125 2023-06-24 16:51:11,764 INFO [train.py:996] (0/4) Epoch 7, batch 7400, loss[loss=0.2124, simple_loss=0.3065, pruned_loss=0.05908, over 21848.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2866, pruned_loss=0.07184, over 4263833.31 frames. ], batch size: 317, lr: 4.40e-03, grad_scale: 16.0 2023-06-24 16:51:14,662 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1142202.0, ans=0.09899494936611666 2023-06-24 16:51:21,519 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1142202.0, ans=0.2 2023-06-24 16:51:41,600 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.150e+02 2.851e+02 3.315e+02 4.181e+02 6.542e+02, threshold=6.630e+02, percent-clipped=3.0 2023-06-24 16:52:03,137 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1142322.0, ans=0.125 2023-06-24 16:52:19,742 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1142322.0, ans=0.125 2023-06-24 16:52:21,335 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 16:52:22,789 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1142322.0, ans=0.125 2023-06-24 16:52:26,339 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1142382.0, ans=0.0 2023-06-24 16:52:44,309 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1142442.0, ans=0.125 2023-06-24 16:52:55,087 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1142442.0, ans=0.1 2023-06-24 16:53:03,478 INFO [train.py:996] (0/4) Epoch 7, batch 7450, loss[loss=0.2151, simple_loss=0.2795, pruned_loss=0.07533, over 21880.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2849, pruned_loss=0.07148, over 4260563.84 frames. ], batch size: 373, lr: 4.40e-03, grad_scale: 16.0 2023-06-24 16:53:15,599 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.67 vs. limit=15.0 2023-06-24 16:53:52,979 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1142562.0, ans=0.0 2023-06-24 16:54:47,447 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1142742.0, ans=0.025 2023-06-24 16:55:06,453 INFO [train.py:996] (0/4) Epoch 7, batch 7500, loss[loss=0.2737, simple_loss=0.3756, pruned_loss=0.08589, over 21671.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.29, pruned_loss=0.07389, over 4261611.59 frames. ], batch size: 414, lr: 4.40e-03, grad_scale: 16.0 2023-06-24 16:55:29,607 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.290e+02 3.030e+02 3.534e+02 4.560e+02 9.672e+02, threshold=7.067e+02, percent-clipped=6.0 2023-06-24 16:56:26,247 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.82 vs. limit=12.0 2023-06-24 16:56:47,778 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1143042.0, ans=0.2 2023-06-24 16:56:56,951 INFO [train.py:996] (0/4) Epoch 7, batch 7550, loss[loss=0.2017, simple_loss=0.297, pruned_loss=0.05317, over 21725.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.2969, pruned_loss=0.07201, over 4259097.84 frames. ], batch size: 247, lr: 4.40e-03, grad_scale: 16.0 2023-06-24 16:57:30,524 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 16:58:41,054 INFO [train.py:996] (0/4) Epoch 7, batch 7600, loss[loss=0.1846, simple_loss=0.2514, pruned_loss=0.05893, over 16783.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.2976, pruned_loss=0.07169, over 4257632.22 frames. ], batch size: 60, lr: 4.39e-03, grad_scale: 32.0 2023-06-24 16:59:09,489 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.047e+02 2.834e+02 3.229e+02 4.103e+02 6.859e+02, threshold=6.458e+02, percent-clipped=0.0 2023-06-24 16:59:25,374 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1143522.0, ans=0.125 2023-06-24 16:59:32,721 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1143522.0, ans=0.125 2023-06-24 17:00:24,031 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1143642.0, ans=0.125 2023-06-24 17:00:26,833 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.42 vs. limit=22.5 2023-06-24 17:00:36,345 INFO [train.py:996] (0/4) Epoch 7, batch 7650, loss[loss=0.2193, simple_loss=0.3214, pruned_loss=0.05857, over 19885.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.2962, pruned_loss=0.07317, over 4270778.64 frames. ], batch size: 703, lr: 4.39e-03, grad_scale: 32.0 2023-06-24 17:01:20,540 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1143762.0, ans=0.125 2023-06-24 17:01:36,451 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1143822.0, ans=0.125 2023-06-24 17:01:39,805 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1143882.0, ans=0.0 2023-06-24 17:02:28,492 INFO [train.py:996] (0/4) Epoch 7, batch 7700, loss[loss=0.2729, simple_loss=0.343, pruned_loss=0.1014, over 21789.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.3004, pruned_loss=0.07608, over 4278030.05 frames. ], batch size: 441, lr: 4.39e-03, grad_scale: 16.0 2023-06-24 17:02:53,678 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.223e+02 2.786e+02 3.159e+02 3.961e+02 6.423e+02, threshold=6.319e+02, percent-clipped=0.0 2023-06-24 17:04:29,057 INFO [train.py:996] (0/4) Epoch 7, batch 7750, loss[loss=0.1508, simple_loss=0.2059, pruned_loss=0.04787, over 17188.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.305, pruned_loss=0.07544, over 4266137.24 frames. ], batch size: 62, lr: 4.39e-03, grad_scale: 16.0 2023-06-24 17:04:53,407 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.35 vs. limit=6.0 2023-06-24 17:05:14,638 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1144422.0, ans=0.125 2023-06-24 17:06:09,425 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.20 vs. limit=10.0 2023-06-24 17:06:27,899 INFO [train.py:996] (0/4) Epoch 7, batch 7800, loss[loss=0.2367, simple_loss=0.3144, pruned_loss=0.07952, over 21619.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.3055, pruned_loss=0.07587, over 4258642.26 frames. ], batch size: 389, lr: 4.39e-03, grad_scale: 16.0 2023-06-24 17:06:47,321 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.299e+02 3.311e+02 4.032e+02 5.871e+02 9.097e+02, threshold=8.064e+02, percent-clipped=12.0 2023-06-24 17:07:25,659 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1144782.0, ans=0.1 2023-06-24 17:07:43,678 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1144782.0, ans=0.125 2023-06-24 17:08:11,746 INFO [train.py:996] (0/4) Epoch 7, batch 7850, loss[loss=0.2057, simple_loss=0.2711, pruned_loss=0.07015, over 21294.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.2984, pruned_loss=0.07513, over 4258342.58 frames. ], batch size: 211, lr: 4.39e-03, grad_scale: 16.0 2023-06-24 17:09:54,316 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1145142.0, ans=0.125 2023-06-24 17:10:10,680 INFO [train.py:996] (0/4) Epoch 7, batch 7900, loss[loss=0.2401, simple_loss=0.3233, pruned_loss=0.0784, over 21698.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.2936, pruned_loss=0.07433, over 4252471.68 frames. ], batch size: 298, lr: 4.39e-03, grad_scale: 16.0 2023-06-24 17:10:13,334 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1145202.0, ans=0.0 2023-06-24 17:10:30,881 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.267e+02 2.893e+02 3.310e+02 4.075e+02 8.177e+02, threshold=6.621e+02, percent-clipped=1.0 2023-06-24 17:10:40,880 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1145262.0, ans=0.0 2023-06-24 17:11:14,834 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 17:11:53,339 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.64 vs. limit=22.5 2023-06-24 17:12:01,936 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1145502.0, ans=0.0 2023-06-24 17:12:02,927 INFO [train.py:996] (0/4) Epoch 7, batch 7950, loss[loss=0.2404, simple_loss=0.3059, pruned_loss=0.08749, over 20670.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.3004, pruned_loss=0.07425, over 4260013.37 frames. ], batch size: 607, lr: 4.39e-03, grad_scale: 16.0 2023-06-24 17:12:05,381 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1145502.0, ans=0.07 2023-06-24 17:13:54,792 INFO [train.py:996] (0/4) Epoch 7, batch 8000, loss[loss=0.2035, simple_loss=0.3129, pruned_loss=0.04703, over 20850.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3053, pruned_loss=0.07629, over 4263644.27 frames. ], batch size: 609, lr: 4.39e-03, grad_scale: 32.0 2023-06-24 17:14:04,160 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.58 vs. limit=15.0 2023-06-24 17:14:22,383 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.090e+02 2.777e+02 3.258e+02 3.899e+02 6.990e+02, threshold=6.515e+02, percent-clipped=3.0 2023-06-24 17:15:06,094 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=1145922.0, ans=0.025 2023-06-24 17:15:27,796 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1145982.0, ans=0.0 2023-06-24 17:15:32,937 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1146042.0, ans=0.2 2023-06-24 17:15:57,254 INFO [train.py:996] (0/4) Epoch 7, batch 8050, loss[loss=0.2494, simple_loss=0.3379, pruned_loss=0.08042, over 21873.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3107, pruned_loss=0.07738, over 4269601.29 frames. ], batch size: 372, lr: 4.39e-03, grad_scale: 16.0 2023-06-24 17:16:46,505 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.10 vs. limit=15.0 2023-06-24 17:17:00,461 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.09 vs. limit=15.0 2023-06-24 17:17:48,335 INFO [train.py:996] (0/4) Epoch 7, batch 8100, loss[loss=0.2251, simple_loss=0.2938, pruned_loss=0.07823, over 21808.00 frames. ], tot_loss[loss=0.231, simple_loss=0.308, pruned_loss=0.077, over 4276064.05 frames. ], batch size: 247, lr: 4.39e-03, grad_scale: 16.0 2023-06-24 17:18:04,272 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1146402.0, ans=0.0 2023-06-24 17:18:05,826 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 17:18:21,656 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.184e+02 3.042e+02 3.840e+02 5.397e+02 9.623e+02, threshold=7.680e+02, percent-clipped=13.0 2023-06-24 17:18:45,650 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1146522.0, ans=0.125 2023-06-24 17:19:04,828 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.49 vs. limit=10.0 2023-06-24 17:19:23,390 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1146582.0, ans=0.1 2023-06-24 17:19:55,078 INFO [train.py:996] (0/4) Epoch 7, batch 8150, loss[loss=0.2557, simple_loss=0.3661, pruned_loss=0.07268, over 21583.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3157, pruned_loss=0.0795, over 4276669.61 frames. ], batch size: 389, lr: 4.39e-03, grad_scale: 16.0 2023-06-24 17:20:12,251 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.95 vs. limit=10.0 2023-06-24 17:20:22,511 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1146762.0, ans=0.125 2023-06-24 17:20:32,754 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1146822.0, ans=0.1 2023-06-24 17:20:44,757 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1146822.0, ans=0.1 2023-06-24 17:20:48,419 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1146882.0, ans=0.125 2023-06-24 17:21:44,385 INFO [train.py:996] (0/4) Epoch 7, batch 8200, loss[loss=0.1801, simple_loss=0.2479, pruned_loss=0.05613, over 21647.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3077, pruned_loss=0.0757, over 4261666.12 frames. ], batch size: 247, lr: 4.39e-03, grad_scale: 16.0 2023-06-24 17:21:52,965 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.74 vs. limit=15.0 2023-06-24 17:21:53,960 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1147002.0, ans=0.2 2023-06-24 17:21:54,475 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.12 vs. limit=22.5 2023-06-24 17:22:06,169 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.043e+02 2.961e+02 3.959e+02 5.617e+02 1.113e+03, threshold=7.919e+02, percent-clipped=3.0 2023-06-24 17:22:38,399 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1147182.0, ans=0.1 2023-06-24 17:22:54,688 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1147182.0, ans=0.125 2023-06-24 17:23:29,585 INFO [train.py:996] (0/4) Epoch 7, batch 8250, loss[loss=0.2287, simple_loss=0.3258, pruned_loss=0.06586, over 21734.00 frames. ], tot_loss[loss=0.2276, simple_loss=0.3054, pruned_loss=0.07487, over 4267106.31 frames. ], batch size: 282, lr: 4.39e-03, grad_scale: 16.0 2023-06-24 17:24:17,817 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=6.82 vs. limit=22.5 2023-06-24 17:24:39,536 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1147482.0, ans=0.0 2023-06-24 17:25:01,333 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1147542.0, ans=0.125 2023-06-24 17:25:09,470 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1147542.0, ans=0.0 2023-06-24 17:25:22,782 INFO [train.py:996] (0/4) Epoch 7, batch 8300, loss[loss=0.1906, simple_loss=0.2693, pruned_loss=0.05598, over 21178.00 frames. ], tot_loss[loss=0.223, simple_loss=0.3017, pruned_loss=0.07212, over 4267407.30 frames. ], batch size: 176, lr: 4.39e-03, grad_scale: 16.0 2023-06-24 17:25:43,625 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.968e+02 2.710e+02 3.107e+02 3.703e+02 5.803e+02, threshold=6.215e+02, percent-clipped=0.0 2023-06-24 17:27:04,520 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1147842.0, ans=0.125 2023-06-24 17:27:12,223 INFO [train.py:996] (0/4) Epoch 7, batch 8350, loss[loss=0.2101, simple_loss=0.2866, pruned_loss=0.06683, over 21870.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.3013, pruned_loss=0.07085, over 4265339.91 frames. ], batch size: 373, lr: 4.39e-03, grad_scale: 16.0 2023-06-24 17:28:19,833 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1148082.0, ans=0.125 2023-06-24 17:28:56,248 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.84 vs. limit=15.0 2023-06-24 17:29:03,705 INFO [train.py:996] (0/4) Epoch 7, batch 8400, loss[loss=0.2371, simple_loss=0.3627, pruned_loss=0.05576, over 20810.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2982, pruned_loss=0.06819, over 4268110.67 frames. ], batch size: 607, lr: 4.39e-03, grad_scale: 32.0 2023-06-24 17:29:25,438 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.909e+02 2.527e+02 3.220e+02 3.909e+02 1.035e+03, threshold=6.440e+02, percent-clipped=5.0 2023-06-24 17:29:59,837 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1148322.0, ans=0.0 2023-06-24 17:30:08,755 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1148382.0, ans=0.125 2023-06-24 17:30:16,563 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1148382.0, ans=0.07 2023-06-24 17:30:30,902 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1148442.0, ans=0.125 2023-06-24 17:30:38,150 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1148442.0, ans=0.025 2023-06-24 17:30:47,871 INFO [train.py:996] (0/4) Epoch 7, batch 8450, loss[loss=0.2237, simple_loss=0.3392, pruned_loss=0.05413, over 20873.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.2976, pruned_loss=0.068, over 4275262.64 frames. ], batch size: 607, lr: 4.39e-03, grad_scale: 32.0 2023-06-24 17:31:06,129 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.30 vs. limit=22.5 2023-06-24 17:31:26,712 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1148622.0, ans=0.1 2023-06-24 17:32:36,622 INFO [train.py:996] (0/4) Epoch 7, batch 8500, loss[loss=0.2352, simple_loss=0.2825, pruned_loss=0.09397, over 21340.00 frames. ], tot_loss[loss=0.2156, simple_loss=0.2928, pruned_loss=0.06919, over 4276529.41 frames. ], batch size: 473, lr: 4.38e-03, grad_scale: 32.0 2023-06-24 17:32:57,150 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.311e+02 2.839e+02 3.413e+02 4.005e+02 7.078e+02, threshold=6.826e+02, percent-clipped=2.0 2023-06-24 17:33:06,919 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1148862.0, ans=0.125 2023-06-24 17:33:35,216 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1148922.0, ans=0.0 2023-06-24 17:33:37,760 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.77 vs. limit=22.5 2023-06-24 17:33:57,147 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.01 vs. limit=22.5 2023-06-24 17:34:26,835 INFO [train.py:996] (0/4) Epoch 7, batch 8550, loss[loss=0.2214, simple_loss=0.2934, pruned_loss=0.07469, over 21577.00 frames. ], tot_loss[loss=0.219, simple_loss=0.2952, pruned_loss=0.07136, over 4273706.13 frames. ], batch size: 263, lr: 4.38e-03, grad_scale: 32.0 2023-06-24 17:35:44,249 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1149282.0, ans=0.125 2023-06-24 17:36:02,303 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1149342.0, ans=0.125 2023-06-24 17:36:03,894 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1149342.0, ans=0.125 2023-06-24 17:36:18,071 INFO [train.py:996] (0/4) Epoch 7, batch 8600, loss[loss=0.2637, simple_loss=0.3441, pruned_loss=0.0916, over 21384.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.3028, pruned_loss=0.07437, over 4267776.30 frames. ], batch size: 176, lr: 4.38e-03, grad_scale: 32.0 2023-06-24 17:36:24,476 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1149402.0, ans=0.125 2023-06-24 17:36:36,391 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.81 vs. limit=15.0 2023-06-24 17:36:40,532 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.213e+02 3.018e+02 3.698e+02 4.926e+02 7.683e+02, threshold=7.396e+02, percent-clipped=5.0 2023-06-24 17:37:31,226 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1149582.0, ans=0.2 2023-06-24 17:37:31,326 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1149582.0, ans=0.0 2023-06-24 17:37:48,786 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1149642.0, ans=0.125 2023-06-24 17:37:58,832 INFO [train.py:996] (0/4) Epoch 7, batch 8650, loss[loss=0.1707, simple_loss=0.2646, pruned_loss=0.03844, over 21584.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3081, pruned_loss=0.07504, over 4267782.79 frames. ], batch size: 230, lr: 4.38e-03, grad_scale: 32.0 2023-06-24 17:39:21,480 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.40 vs. limit=22.5 2023-06-24 17:39:31,296 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1149942.0, ans=0.125 2023-06-24 17:39:41,534 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1150002.0, ans=0.125 2023-06-24 17:39:42,724 INFO [train.py:996] (0/4) Epoch 7, batch 8700, loss[loss=0.2105, simple_loss=0.2749, pruned_loss=0.07311, over 21868.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.3006, pruned_loss=0.07228, over 4263766.59 frames. ], batch size: 373, lr: 4.38e-03, grad_scale: 16.0 2023-06-24 17:39:44,746 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1150002.0, ans=0.125 2023-06-24 17:39:46,277 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1150002.0, ans=0.0 2023-06-24 17:40:05,170 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1150062.0, ans=0.125 2023-06-24 17:40:09,728 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.692e+02 2.588e+02 3.028e+02 3.644e+02 6.697e+02, threshold=6.057e+02, percent-clipped=0.0 2023-06-24 17:40:31,661 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1150122.0, ans=0.0 2023-06-24 17:40:52,508 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1150122.0, ans=0.125 2023-06-24 17:41:23,098 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.96 vs. limit=10.0 2023-06-24 17:41:30,718 INFO [train.py:996] (0/4) Epoch 7, batch 8750, loss[loss=0.1943, simple_loss=0.262, pruned_loss=0.06337, over 21671.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.2975, pruned_loss=0.07231, over 4272787.03 frames. ], batch size: 230, lr: 4.38e-03, grad_scale: 16.0 2023-06-24 17:42:03,149 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1150362.0, ans=0.2 2023-06-24 17:42:04,629 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1150362.0, ans=0.125 2023-06-24 17:43:22,466 INFO [train.py:996] (0/4) Epoch 7, batch 8800, loss[loss=0.2912, simple_loss=0.3654, pruned_loss=0.1085, over 21835.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.3068, pruned_loss=0.07523, over 4277816.17 frames. ], batch size: 118, lr: 4.38e-03, grad_scale: 32.0 2023-06-24 17:44:02,069 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.93 vs. limit=15.0 2023-06-24 17:44:02,359 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.260e+02 3.059e+02 3.780e+02 4.742e+02 8.855e+02, threshold=7.560e+02, percent-clipped=10.0 2023-06-24 17:44:04,751 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1150662.0, ans=0.07 2023-06-24 17:44:26,553 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1150722.0, ans=0.1 2023-06-24 17:45:15,959 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.83 vs. limit=6.0 2023-06-24 17:45:24,905 INFO [train.py:996] (0/4) Epoch 7, batch 8850, loss[loss=0.2311, simple_loss=0.3075, pruned_loss=0.07738, over 21710.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3126, pruned_loss=0.07746, over 4273606.02 frames. ], batch size: 124, lr: 4.38e-03, grad_scale: 32.0 2023-06-24 17:45:30,586 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff2.min_abs, batch_count=1150902.0, ans=0.1 2023-06-24 17:45:46,940 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.64 vs. limit=22.5 2023-06-24 17:45:46,994 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.37 vs. limit=15.0 2023-06-24 17:46:07,792 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1150962.0, ans=0.2 2023-06-24 17:46:11,282 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1151022.0, ans=0.125 2023-06-24 17:46:14,862 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1151022.0, ans=0.2 2023-06-24 17:46:18,632 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1151022.0, ans=0.125 2023-06-24 17:46:37,222 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.52 vs. limit=12.0 2023-06-24 17:47:16,892 INFO [train.py:996] (0/4) Epoch 7, batch 8900, loss[loss=0.2451, simple_loss=0.3015, pruned_loss=0.09432, over 20012.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.3065, pruned_loss=0.07625, over 4274355.07 frames. ], batch size: 702, lr: 4.38e-03, grad_scale: 16.0 2023-06-24 17:47:54,310 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.186e+02 2.946e+02 3.604e+02 5.046e+02 1.118e+03, threshold=7.207e+02, percent-clipped=3.0 2023-06-24 17:47:55,056 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1151262.0, ans=0.125 2023-06-24 17:48:15,419 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1151322.0, ans=0.125 2023-06-24 17:48:15,437 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1151322.0, ans=0.0 2023-06-24 17:48:20,654 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1151322.0, ans=0.1 2023-06-24 17:48:53,808 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1151442.0, ans=0.0 2023-06-24 17:49:21,206 INFO [train.py:996] (0/4) Epoch 7, batch 8950, loss[loss=0.1973, simple_loss=0.268, pruned_loss=0.0633, over 21630.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.3052, pruned_loss=0.07527, over 4268542.04 frames. ], batch size: 247, lr: 4.38e-03, grad_scale: 16.0 2023-06-24 17:50:37,401 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1151682.0, ans=0.0 2023-06-24 17:50:39,511 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1151682.0, ans=0.0 2023-06-24 17:51:08,320 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.49 vs. limit=12.0 2023-06-24 17:51:10,374 INFO [train.py:996] (0/4) Epoch 7, batch 9000, loss[loss=0.1927, simple_loss=0.2427, pruned_loss=0.07134, over 21225.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.3011, pruned_loss=0.0756, over 4273066.05 frames. ], batch size: 159, lr: 4.38e-03, grad_scale: 16.0 2023-06-24 17:51:10,375 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-24 17:51:28,286 INFO [train.py:1028] (0/4) Epoch 7, validation: loss=0.2657, simple_loss=0.3576, pruned_loss=0.0869, over 1796401.00 frames. 2023-06-24 17:51:28,287 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23616MB 2023-06-24 17:51:41,898 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1151802.0, ans=0.0 2023-06-24 17:51:53,794 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.127e+02 2.929e+02 3.694e+02 4.955e+02 7.799e+02, threshold=7.388e+02, percent-clipped=3.0 2023-06-24 17:52:09,120 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1151922.0, ans=0.07 2023-06-24 17:52:37,618 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-192000.pt 2023-06-24 17:53:21,714 INFO [train.py:996] (0/4) Epoch 7, batch 9050, loss[loss=0.1911, simple_loss=0.2657, pruned_loss=0.05831, over 21286.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.2969, pruned_loss=0.07275, over 4262409.79 frames. ], batch size: 551, lr: 4.38e-03, grad_scale: 16.0 2023-06-24 17:53:22,449 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1152102.0, ans=0.1 2023-06-24 17:54:53,625 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1152282.0, ans=0.0 2023-06-24 17:55:14,099 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=1152402.0, ans=10.0 2023-06-24 17:55:14,805 INFO [train.py:996] (0/4) Epoch 7, batch 9100, loss[loss=0.2118, simple_loss=0.3121, pruned_loss=0.05578, over 21741.00 frames. ], tot_loss[loss=0.225, simple_loss=0.3017, pruned_loss=0.0742, over 4266714.32 frames. ], batch size: 351, lr: 4.38e-03, grad_scale: 16.0 2023-06-24 17:55:45,275 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.881e+02 2.655e+02 3.193e+02 3.861e+02 6.275e+02, threshold=6.386e+02, percent-clipped=0.0 2023-06-24 17:56:11,112 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1152522.0, ans=0.125 2023-06-24 17:56:16,119 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1152522.0, ans=0.125 2023-06-24 17:56:41,890 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1152642.0, ans=0.5 2023-06-24 17:56:49,347 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1152642.0, ans=0.1 2023-06-24 17:56:58,431 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1152642.0, ans=0.125 2023-06-24 17:57:01,025 INFO [train.py:996] (0/4) Epoch 7, batch 9150, loss[loss=0.2285, simple_loss=0.3209, pruned_loss=0.06801, over 21767.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.3064, pruned_loss=0.07256, over 4274737.86 frames. ], batch size: 298, lr: 4.38e-03, grad_scale: 16.0 2023-06-24 17:57:19,443 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1152702.0, ans=0.1 2023-06-24 17:57:29,301 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.71 vs. limit=15.0 2023-06-24 17:58:44,885 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1152942.0, ans=0.125 2023-06-24 17:58:55,720 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1152942.0, ans=0.0 2023-06-24 17:58:58,734 INFO [train.py:996] (0/4) Epoch 7, batch 9200, loss[loss=0.2742, simple_loss=0.3448, pruned_loss=0.1018, over 21773.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.3073, pruned_loss=0.07186, over 4276659.92 frames. ], batch size: 124, lr: 4.38e-03, grad_scale: 32.0 2023-06-24 17:59:29,540 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.106e+02 2.740e+02 3.426e+02 4.320e+02 8.569e+02, threshold=6.853e+02, percent-clipped=6.0 2023-06-24 18:00:13,322 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.71 vs. limit=8.0 2023-06-24 18:00:26,349 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1153242.0, ans=0.125 2023-06-24 18:00:37,608 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1153242.0, ans=0.1 2023-06-24 18:00:50,676 INFO [train.py:996] (0/4) Epoch 7, batch 9250, loss[loss=0.2505, simple_loss=0.3175, pruned_loss=0.09173, over 21706.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3092, pruned_loss=0.07432, over 4282259.76 frames. ], batch size: 298, lr: 4.38e-03, grad_scale: 32.0 2023-06-24 18:00:55,016 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 18:01:39,023 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.19 vs. limit=15.0 2023-06-24 18:02:36,069 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1153542.0, ans=0.125 2023-06-24 18:02:42,902 INFO [train.py:996] (0/4) Epoch 7, batch 9300, loss[loss=0.2237, simple_loss=0.3181, pruned_loss=0.0646, over 20706.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.3021, pruned_loss=0.07382, over 4276136.27 frames. ], batch size: 607, lr: 4.38e-03, grad_scale: 32.0 2023-06-24 18:03:05,724 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1153662.0, ans=0.1 2023-06-24 18:03:13,914 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.140e+02 3.058e+02 3.549e+02 4.364e+02 7.419e+02, threshold=7.098e+02, percent-clipped=2.0 2023-06-24 18:03:39,881 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.21 vs. limit=22.5 2023-06-24 18:04:28,645 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.70 vs. limit=6.0 2023-06-24 18:04:29,058 INFO [train.py:996] (0/4) Epoch 7, batch 9350, loss[loss=0.315, simple_loss=0.3667, pruned_loss=0.1317, over 21324.00 frames. ], tot_loss[loss=0.23, simple_loss=0.3098, pruned_loss=0.0751, over 4278484.62 frames. ], batch size: 507, lr: 4.37e-03, grad_scale: 32.0 2023-06-24 18:05:08,231 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1153962.0, ans=0.125 2023-06-24 18:05:17,238 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1153962.0, ans=0.125 2023-06-24 18:05:43,205 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.09 vs. limit=22.5 2023-06-24 18:06:01,527 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1154142.0, ans=0.1 2023-06-24 18:06:31,745 INFO [train.py:996] (0/4) Epoch 7, batch 9400, loss[loss=0.2226, simple_loss=0.2904, pruned_loss=0.07743, over 21478.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3139, pruned_loss=0.07576, over 4265143.87 frames. ], batch size: 389, lr: 4.37e-03, grad_scale: 32.0 2023-06-24 18:06:34,770 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1154202.0, ans=0.0 2023-06-24 18:07:02,154 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.141e+02 2.873e+02 3.280e+02 3.858e+02 8.681e+02, threshold=6.561e+02, percent-clipped=2.0 2023-06-24 18:07:09,082 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.86 vs. limit=10.0 2023-06-24 18:07:30,380 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=1154322.0, ans=15.0 2023-06-24 18:08:14,144 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.30 vs. limit=15.0 2023-06-24 18:08:21,787 INFO [train.py:996] (0/4) Epoch 7, batch 9450, loss[loss=0.176, simple_loss=0.2419, pruned_loss=0.05504, over 21558.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.3054, pruned_loss=0.07421, over 4265332.21 frames. ], batch size: 263, lr: 4.37e-03, grad_scale: 16.0 2023-06-24 18:08:34,022 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1154502.0, ans=0.0 2023-06-24 18:09:08,725 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1154622.0, ans=0.1 2023-06-24 18:09:14,720 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.75 vs. limit=15.0 2023-06-24 18:09:28,603 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1154682.0, ans=0.0 2023-06-24 18:09:37,123 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1154682.0, ans=0.125 2023-06-24 18:09:47,550 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1154742.0, ans=0.0 2023-06-24 18:10:03,125 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=1154802.0, ans=0.025 2023-06-24 18:10:10,157 INFO [train.py:996] (0/4) Epoch 7, batch 9500, loss[loss=0.2054, simple_loss=0.2654, pruned_loss=0.07269, over 21493.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.296, pruned_loss=0.07274, over 4262935.46 frames. ], batch size: 391, lr: 4.37e-03, grad_scale: 16.0 2023-06-24 18:10:40,170 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.23 vs. limit=22.5 2023-06-24 18:10:41,272 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1154862.0, ans=0.125 2023-06-24 18:10:42,232 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.100e+02 2.886e+02 3.476e+02 4.165e+02 8.781e+02, threshold=6.953e+02, percent-clipped=4.0 2023-06-24 18:10:46,411 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1154862.0, ans=0.2 2023-06-24 18:11:44,624 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.41 vs. limit=15.0 2023-06-24 18:12:01,001 INFO [train.py:996] (0/4) Epoch 7, batch 9550, loss[loss=0.2572, simple_loss=0.3303, pruned_loss=0.09206, over 21600.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.3018, pruned_loss=0.07473, over 4263724.94 frames. ], batch size: 389, lr: 4.37e-03, grad_scale: 16.0 2023-06-24 18:12:26,653 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1155162.0, ans=0.2 2023-06-24 18:12:36,323 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1155162.0, ans=0.1 2023-06-24 18:12:40,711 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.79 vs. limit=15.0 2023-06-24 18:12:43,931 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.32 vs. limit=22.5 2023-06-24 18:12:45,119 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1155222.0, ans=0.125 2023-06-24 18:12:54,026 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.30 vs. limit=6.0 2023-06-24 18:12:59,033 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1155222.0, ans=0.07 2023-06-24 18:13:50,418 INFO [train.py:996] (0/4) Epoch 7, batch 9600, loss[loss=0.2206, simple_loss=0.2853, pruned_loss=0.07798, over 21519.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3039, pruned_loss=0.0762, over 4270585.86 frames. ], batch size: 548, lr: 4.37e-03, grad_scale: 32.0 2023-06-24 18:14:23,109 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.186e+02 3.053e+02 3.563e+02 4.666e+02 8.626e+02, threshold=7.126e+02, percent-clipped=5.0 2023-06-24 18:15:45,074 INFO [train.py:996] (0/4) Epoch 7, batch 9650, loss[loss=0.2599, simple_loss=0.3744, pruned_loss=0.0727, over 20837.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.3029, pruned_loss=0.07582, over 4276084.02 frames. ], batch size: 608, lr: 4.37e-03, grad_scale: 16.0 2023-06-24 18:16:14,079 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1155762.0, ans=0.2 2023-06-24 18:16:18,538 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.26 vs. limit=15.0 2023-06-24 18:16:38,706 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1155822.0, ans=0.0 2023-06-24 18:17:34,788 INFO [train.py:996] (0/4) Epoch 7, batch 9700, loss[loss=0.2565, simple_loss=0.3195, pruned_loss=0.09675, over 21588.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3077, pruned_loss=0.07655, over 4270804.46 frames. ], batch size: 471, lr: 4.37e-03, grad_scale: 16.0 2023-06-24 18:17:44,571 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1156002.0, ans=0.125 2023-06-24 18:18:08,281 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.126e+02 2.706e+02 3.025e+02 3.673e+02 7.479e+02, threshold=6.049e+02, percent-clipped=1.0 2023-06-24 18:18:21,318 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1156122.0, ans=0.125 2023-06-24 18:18:25,496 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.59 vs. limit=6.0 2023-06-24 18:18:32,723 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.98 vs. limit=6.0 2023-06-24 18:18:49,360 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1156182.0, ans=0.0 2023-06-24 18:19:11,798 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1156242.0, ans=0.1 2023-06-24 18:19:18,101 INFO [train.py:996] (0/4) Epoch 7, batch 9750, loss[loss=0.2697, simple_loss=0.3528, pruned_loss=0.09337, over 21888.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.3017, pruned_loss=0.07566, over 4275330.00 frames. ], batch size: 107, lr: 4.37e-03, grad_scale: 16.0 2023-06-24 18:20:15,041 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1156482.0, ans=0.125 2023-06-24 18:20:21,930 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1156482.0, ans=0.125 2023-06-24 18:20:31,431 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1156482.0, ans=0.125 2023-06-24 18:20:40,103 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1156542.0, ans=0.0 2023-06-24 18:21:07,433 INFO [train.py:996] (0/4) Epoch 7, batch 9800, loss[loss=0.2064, simple_loss=0.2846, pruned_loss=0.06413, over 21943.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.3018, pruned_loss=0.07576, over 4258682.64 frames. ], batch size: 316, lr: 4.37e-03, grad_scale: 16.0 2023-06-24 18:21:11,867 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1156602.0, ans=0.2 2023-06-24 18:21:39,989 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.170e+02 2.762e+02 3.059e+02 4.077e+02 6.018e+02, threshold=6.118e+02, percent-clipped=0.0 2023-06-24 18:21:52,267 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.23 vs. limit=6.0 2023-06-24 18:22:36,307 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 18:22:55,824 INFO [train.py:996] (0/4) Epoch 7, batch 9850, loss[loss=0.1896, simple_loss=0.2549, pruned_loss=0.0622, over 21686.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.2976, pruned_loss=0.07597, over 4260270.50 frames. ], batch size: 282, lr: 4.37e-03, grad_scale: 16.0 2023-06-24 18:23:04,510 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1156902.0, ans=0.0 2023-06-24 18:23:17,064 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.94 vs. limit=5.0 2023-06-24 18:23:21,749 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1156962.0, ans=0.125 2023-06-24 18:23:42,204 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1157022.0, ans=0.125 2023-06-24 18:24:38,522 INFO [train.py:996] (0/4) Epoch 7, batch 9900, loss[loss=0.2503, simple_loss=0.3263, pruned_loss=0.08714, over 21464.00 frames. ], tot_loss[loss=0.222, simple_loss=0.2936, pruned_loss=0.07522, over 4243879.76 frames. ], batch size: 131, lr: 4.37e-03, grad_scale: 16.0 2023-06-24 18:24:44,298 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1157202.0, ans=0.125 2023-06-24 18:25:04,528 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.45 vs. limit=22.5 2023-06-24 18:25:12,376 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.163e+02 2.791e+02 3.369e+02 4.122e+02 6.726e+02, threshold=6.739e+02, percent-clipped=1.0 2023-06-24 18:25:14,451 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1157262.0, ans=0.1 2023-06-24 18:26:27,528 INFO [train.py:996] (0/4) Epoch 7, batch 9950, loss[loss=0.2411, simple_loss=0.3157, pruned_loss=0.08319, over 21561.00 frames. ], tot_loss[loss=0.225, simple_loss=0.2957, pruned_loss=0.0771, over 4251937.02 frames. ], batch size: 389, lr: 4.37e-03, grad_scale: 16.0 2023-06-24 18:27:11,488 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1157622.0, ans=0.0 2023-06-24 18:27:32,952 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1157682.0, ans=0.07 2023-06-24 18:28:15,628 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1157802.0, ans=0.125 2023-06-24 18:28:16,566 INFO [train.py:996] (0/4) Epoch 7, batch 10000, loss[loss=0.1924, simple_loss=0.2538, pruned_loss=0.06549, over 21556.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.2914, pruned_loss=0.07516, over 4248339.97 frames. ], batch size: 263, lr: 4.37e-03, grad_scale: 32.0 2023-06-24 18:28:33,621 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1157862.0, ans=0.2 2023-06-24 18:28:40,447 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1157862.0, ans=0.0 2023-06-24 18:28:49,778 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.007e+02 2.643e+02 3.254e+02 4.440e+02 7.063e+02, threshold=6.507e+02, percent-clipped=1.0 2023-06-24 18:29:05,552 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.09 vs. limit=15.0 2023-06-24 18:29:55,405 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1158042.0, ans=0.0 2023-06-24 18:30:04,070 INFO [train.py:996] (0/4) Epoch 7, batch 10050, loss[loss=0.2372, simple_loss=0.3099, pruned_loss=0.08232, over 20691.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.2924, pruned_loss=0.07506, over 4258217.08 frames. ], batch size: 607, lr: 4.37e-03, grad_scale: 32.0 2023-06-24 18:30:11,789 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1158102.0, ans=0.125 2023-06-24 18:30:14,050 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.03 vs. limit=15.0 2023-06-24 18:30:33,921 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1158162.0, ans=0.125 2023-06-24 18:31:05,281 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1158222.0, ans=0.0 2023-06-24 18:31:27,225 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.46 vs. limit=22.5 2023-06-24 18:31:45,961 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1158342.0, ans=0.2 2023-06-24 18:32:01,197 INFO [train.py:996] (0/4) Epoch 7, batch 10100, loss[loss=0.2118, simple_loss=0.2963, pruned_loss=0.06368, over 21685.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2897, pruned_loss=0.07309, over 4257693.08 frames. ], batch size: 389, lr: 4.37e-03, grad_scale: 16.0 2023-06-24 18:32:19,012 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1158462.0, ans=0.0 2023-06-24 18:32:30,795 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.965e+02 2.650e+02 3.073e+02 3.822e+02 6.259e+02, threshold=6.145e+02, percent-clipped=0.0 2023-06-24 18:32:46,883 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.34 vs. limit=22.5 2023-06-24 18:33:48,266 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.89 vs. limit=15.0 2023-06-24 18:33:50,320 INFO [train.py:996] (0/4) Epoch 7, batch 10150, loss[loss=0.2045, simple_loss=0.2838, pruned_loss=0.0626, over 21784.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.2957, pruned_loss=0.07497, over 4264153.69 frames. ], batch size: 118, lr: 4.37e-03, grad_scale: 8.0 2023-06-24 18:34:11,138 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1158762.0, ans=0.0 2023-06-24 18:34:12,959 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.56 vs. limit=15.0 2023-06-24 18:34:34,631 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.06 vs. limit=15.0 2023-06-24 18:35:36,341 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1158942.0, ans=0.05 2023-06-24 18:35:39,213 INFO [train.py:996] (0/4) Epoch 7, batch 10200, loss[loss=0.2503, simple_loss=0.3268, pruned_loss=0.0869, over 21509.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.2948, pruned_loss=0.07312, over 4266874.56 frames. ], batch size: 441, lr: 4.37e-03, grad_scale: 8.0 2023-06-24 18:36:01,400 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 18:36:17,255 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.114e+02 2.567e+02 2.979e+02 3.564e+02 7.472e+02, threshold=5.959e+02, percent-clipped=1.0 2023-06-24 18:36:22,774 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1159122.0, ans=0.0 2023-06-24 18:36:58,136 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1159182.0, ans=0.125 2023-06-24 18:37:28,874 INFO [train.py:996] (0/4) Epoch 7, batch 10250, loss[loss=0.1619, simple_loss=0.2456, pruned_loss=0.03912, over 21464.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.2893, pruned_loss=0.06775, over 4271936.86 frames. ], batch size: 212, lr: 4.36e-03, grad_scale: 8.0 2023-06-24 18:37:50,970 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1159362.0, ans=0.0 2023-06-24 18:38:06,255 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1159362.0, ans=0.125 2023-06-24 18:38:10,061 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1159422.0, ans=0.1 2023-06-24 18:38:10,162 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1159422.0, ans=0.0 2023-06-24 18:39:22,124 INFO [train.py:996] (0/4) Epoch 7, batch 10300, loss[loss=0.245, simple_loss=0.3244, pruned_loss=0.08278, over 21237.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2921, pruned_loss=0.06848, over 4266953.10 frames. ], batch size: 159, lr: 4.36e-03, grad_scale: 8.0 2023-06-24 18:40:11,437 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.745e+02 2.687e+02 3.369e+02 4.671e+02 1.084e+03, threshold=6.737e+02, percent-clipped=9.0 2023-06-24 18:40:25,157 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1159722.0, ans=0.2 2023-06-24 18:40:29,964 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1159722.0, ans=0.2 2023-06-24 18:40:32,402 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1159782.0, ans=0.07 2023-06-24 18:40:48,987 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.00 vs. limit=6.0 2023-06-24 18:41:06,444 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1159842.0, ans=0.0 2023-06-24 18:41:13,592 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1159902.0, ans=0.125 2023-06-24 18:41:14,514 INFO [train.py:996] (0/4) Epoch 7, batch 10350, loss[loss=0.2164, simple_loss=0.2952, pruned_loss=0.06885, over 19973.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2944, pruned_loss=0.0698, over 4261214.32 frames. ], batch size: 703, lr: 4.36e-03, grad_scale: 8.0 2023-06-24 18:41:59,815 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1159962.0, ans=0.125 2023-06-24 18:42:38,583 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1160082.0, ans=0.125 2023-06-24 18:43:10,507 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.05 vs. limit=10.0 2023-06-24 18:43:11,912 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1160202.0, ans=0.0 2023-06-24 18:43:11,928 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1160202.0, ans=0.0 2023-06-24 18:43:12,830 INFO [train.py:996] (0/4) Epoch 7, batch 10400, loss[loss=0.2812, simple_loss=0.3428, pruned_loss=0.1098, over 21517.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2898, pruned_loss=0.06957, over 4261725.00 frames. ], batch size: 509, lr: 4.36e-03, grad_scale: 16.0 2023-06-24 18:43:24,830 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1160202.0, ans=0.0 2023-06-24 18:43:38,492 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1160202.0, ans=0.125 2023-06-24 18:43:51,180 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1160262.0, ans=0.1 2023-06-24 18:43:56,280 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.199e+02 2.812e+02 3.590e+02 4.501e+02 9.958e+02, threshold=7.181e+02, percent-clipped=6.0 2023-06-24 18:44:38,298 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=1160382.0, ans=0.05 2023-06-24 18:45:05,782 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1160442.0, ans=0.125 2023-06-24 18:45:15,905 INFO [train.py:996] (0/4) Epoch 7, batch 10450, loss[loss=0.2586, simple_loss=0.3281, pruned_loss=0.09452, over 21334.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2932, pruned_loss=0.0709, over 4259134.35 frames. ], batch size: 549, lr: 4.36e-03, grad_scale: 16.0 2023-06-24 18:45:29,096 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1160502.0, ans=0.125 2023-06-24 18:45:30,886 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1160502.0, ans=0.0 2023-06-24 18:46:05,442 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1160622.0, ans=0.0 2023-06-24 18:46:59,680 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1160742.0, ans=0.0 2023-06-24 18:47:06,312 INFO [train.py:996] (0/4) Epoch 7, batch 10500, loss[loss=0.196, simple_loss=0.2811, pruned_loss=0.0554, over 21764.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2947, pruned_loss=0.07026, over 4255743.91 frames. ], batch size: 351, lr: 4.36e-03, grad_scale: 16.0 2023-06-24 18:47:43,179 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.152e+02 2.810e+02 3.423e+02 4.183e+02 6.636e+02, threshold=6.845e+02, percent-clipped=0.0 2023-06-24 18:48:29,943 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 18:48:54,927 INFO [train.py:996] (0/4) Epoch 7, batch 10550, loss[loss=0.1934, simple_loss=0.2618, pruned_loss=0.06246, over 21650.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2885, pruned_loss=0.06974, over 4259199.55 frames. ], batch size: 333, lr: 4.36e-03, grad_scale: 16.0 2023-06-24 18:48:55,372 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1161102.0, ans=0.125 2023-06-24 18:49:07,947 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1161102.0, ans=0.1 2023-06-24 18:49:12,679 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1161162.0, ans=0.0 2023-06-24 18:49:19,837 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 18:50:32,795 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1161342.0, ans=0.125 2023-06-24 18:50:46,834 INFO [train.py:996] (0/4) Epoch 7, batch 10600, loss[loss=0.1942, simple_loss=0.2688, pruned_loss=0.05982, over 21784.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.2847, pruned_loss=0.06857, over 4255412.78 frames. ], batch size: 351, lr: 4.36e-03, grad_scale: 16.0 2023-06-24 18:50:50,030 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1161402.0, ans=0.0 2023-06-24 18:51:24,990 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.905e+02 2.607e+02 2.934e+02 3.561e+02 5.999e+02, threshold=5.868e+02, percent-clipped=0.0 2023-06-24 18:51:29,248 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1161522.0, ans=0.125 2023-06-24 18:51:38,470 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1161522.0, ans=0.1 2023-06-24 18:51:51,788 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1161582.0, ans=0.0 2023-06-24 18:52:02,540 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1161582.0, ans=0.125 2023-06-24 18:52:38,846 INFO [train.py:996] (0/4) Epoch 7, batch 10650, loss[loss=0.2688, simple_loss=0.3539, pruned_loss=0.09181, over 21548.00 frames. ], tot_loss[loss=0.2118, simple_loss=0.2877, pruned_loss=0.06792, over 4256933.06 frames. ], batch size: 471, lr: 4.36e-03, grad_scale: 16.0 2023-06-24 18:52:44,404 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1161702.0, ans=0.0 2023-06-24 18:52:47,961 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1161702.0, ans=0.125 2023-06-24 18:53:07,286 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 18:54:22,980 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1161942.0, ans=0.1 2023-06-24 18:54:29,879 INFO [train.py:996] (0/4) Epoch 7, batch 10700, loss[loss=0.3066, simple_loss=0.3586, pruned_loss=0.1273, over 21327.00 frames. ], tot_loss[loss=0.2118, simple_loss=0.2879, pruned_loss=0.06783, over 4251995.79 frames. ], batch size: 507, lr: 4.36e-03, grad_scale: 16.0 2023-06-24 18:54:33,678 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1162002.0, ans=0.2 2023-06-24 18:55:08,598 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.005e+02 2.935e+02 3.419e+02 4.511e+02 9.695e+02, threshold=6.839e+02, percent-clipped=12.0 2023-06-24 18:55:41,991 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1162182.0, ans=0.2 2023-06-24 18:55:44,120 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.65 vs. limit=22.5 2023-06-24 18:56:29,685 INFO [train.py:996] (0/4) Epoch 7, batch 10750, loss[loss=0.2368, simple_loss=0.3141, pruned_loss=0.07977, over 21320.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.2989, pruned_loss=0.07224, over 4252377.51 frames. ], batch size: 176, lr: 4.36e-03, grad_scale: 16.0 2023-06-24 18:56:50,107 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1162362.0, ans=0.125 2023-06-24 18:57:50,687 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1162482.0, ans=0.125 2023-06-24 18:58:17,100 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1162542.0, ans=0.125 2023-06-24 18:58:21,621 INFO [train.py:996] (0/4) Epoch 7, batch 10800, loss[loss=0.2401, simple_loss=0.3163, pruned_loss=0.08192, over 21811.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.3015, pruned_loss=0.0719, over 4258396.34 frames. ], batch size: 282, lr: 4.36e-03, grad_scale: 32.0 2023-06-24 18:58:53,271 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1162662.0, ans=0.2 2023-06-24 18:59:03,757 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1162662.0, ans=0.035 2023-06-24 18:59:03,786 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1162662.0, ans=0.125 2023-06-24 18:59:06,337 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.283e+02 2.815e+02 3.156e+02 3.825e+02 7.344e+02, threshold=6.312e+02, percent-clipped=1.0 2023-06-24 19:00:03,847 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1162842.0, ans=0.125 2023-06-24 19:00:07,153 INFO [train.py:996] (0/4) Epoch 7, batch 10850, loss[loss=0.193, simple_loss=0.2484, pruned_loss=0.06881, over 20776.00 frames. ], tot_loss[loss=0.224, simple_loss=0.3026, pruned_loss=0.07272, over 4262006.28 frames. ], batch size: 609, lr: 4.36e-03, grad_scale: 16.0 2023-06-24 19:00:39,929 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1162962.0, ans=0.1 2023-06-24 19:01:05,111 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1163022.0, ans=0.125 2023-06-24 19:01:15,781 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1163082.0, ans=0.125 2023-06-24 19:01:59,667 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1163142.0, ans=0.0 2023-06-24 19:02:02,602 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1163202.0, ans=0.1 2023-06-24 19:02:04,080 INFO [train.py:996] (0/4) Epoch 7, batch 10900, loss[loss=0.2018, simple_loss=0.285, pruned_loss=0.05927, over 21222.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.2973, pruned_loss=0.07122, over 4247320.25 frames. ], batch size: 176, lr: 4.36e-03, grad_scale: 16.0 2023-06-24 19:02:26,108 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1163262.0, ans=0.0 2023-06-24 19:02:45,075 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1163262.0, ans=0.125 2023-06-24 19:02:47,849 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.963e+02 2.711e+02 3.083e+02 3.861e+02 1.043e+03, threshold=6.166e+02, percent-clipped=5.0 2023-06-24 19:03:43,160 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1163442.0, ans=10.0 2023-06-24 19:03:53,387 INFO [train.py:996] (0/4) Epoch 7, batch 10950, loss[loss=0.2074, simple_loss=0.2712, pruned_loss=0.07185, over 21708.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2939, pruned_loss=0.06949, over 4237348.13 frames. ], batch size: 300, lr: 4.36e-03, grad_scale: 16.0 2023-06-24 19:04:15,764 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1163562.0, ans=0.125 2023-06-24 19:04:21,491 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1163562.0, ans=0.125 2023-06-24 19:04:47,478 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1163622.0, ans=0.07 2023-06-24 19:05:15,591 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1163682.0, ans=0.125 2023-06-24 19:05:16,150 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.90 vs. limit=12.0 2023-06-24 19:05:42,608 INFO [train.py:996] (0/4) Epoch 7, batch 11000, loss[loss=0.2204, simple_loss=0.287, pruned_loss=0.07692, over 21951.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.2931, pruned_loss=0.0702, over 4246436.56 frames. ], batch size: 316, lr: 4.36e-03, grad_scale: 16.0 2023-06-24 19:05:47,032 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1163802.0, ans=0.125 2023-06-24 19:06:02,522 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1163802.0, ans=0.125 2023-06-24 19:06:16,926 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.60 vs. limit=15.0 2023-06-24 19:06:23,774 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1163862.0, ans=0.125 2023-06-24 19:06:26,186 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.210e+02 2.764e+02 3.110e+02 3.886e+02 6.584e+02, threshold=6.221e+02, percent-clipped=1.0 2023-06-24 19:06:39,108 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1163922.0, ans=0.125 2023-06-24 19:06:39,139 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1163922.0, ans=0.2 2023-06-24 19:06:40,730 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1163922.0, ans=0.0 2023-06-24 19:07:31,776 INFO [train.py:996] (0/4) Epoch 7, batch 11050, loss[loss=0.237, simple_loss=0.3021, pruned_loss=0.086, over 20650.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2906, pruned_loss=0.07168, over 4248515.13 frames. ], batch size: 607, lr: 4.36e-03, grad_scale: 16.0 2023-06-24 19:09:17,968 INFO [train.py:996] (0/4) Epoch 7, batch 11100, loss[loss=0.1741, simple_loss=0.2527, pruned_loss=0.04774, over 15366.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.289, pruned_loss=0.07202, over 4248184.15 frames. ], batch size: 60, lr: 4.36e-03, grad_scale: 16.0 2023-06-24 19:09:28,391 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1164402.0, ans=0.125 2023-06-24 19:09:31,883 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1164402.0, ans=0.125 2023-06-24 19:09:42,719 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.21 vs. limit=15.0 2023-06-24 19:10:00,567 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.194e+02 2.678e+02 3.103e+02 3.561e+02 5.692e+02, threshold=6.205e+02, percent-clipped=0.0 2023-06-24 19:10:03,481 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.87 vs. limit=10.0 2023-06-24 19:10:04,967 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1164522.0, ans=0.1 2023-06-24 19:11:02,494 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1164642.0, ans=0.0 2023-06-24 19:11:05,130 INFO [train.py:996] (0/4) Epoch 7, batch 11150, loss[loss=0.2102, simple_loss=0.3069, pruned_loss=0.05677, over 21778.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.2878, pruned_loss=0.07068, over 4241366.94 frames. ], batch size: 282, lr: 4.35e-03, grad_scale: 16.0 2023-06-24 19:11:21,441 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1164762.0, ans=0.125 2023-06-24 19:11:57,793 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.70 vs. limit=15.0 2023-06-24 19:12:09,323 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1164882.0, ans=0.125 2023-06-24 19:12:52,262 INFO [train.py:996] (0/4) Epoch 7, batch 11200, loss[loss=0.1912, simple_loss=0.252, pruned_loss=0.06524, over 21615.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2859, pruned_loss=0.0705, over 4247041.28 frames. ], batch size: 247, lr: 4.35e-03, grad_scale: 32.0 2023-06-24 19:13:31,529 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.10 vs. limit=10.0 2023-06-24 19:13:35,982 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.061e+02 2.570e+02 2.865e+02 3.266e+02 5.455e+02, threshold=5.730e+02, percent-clipped=0.0 2023-06-24 19:14:41,015 INFO [train.py:996] (0/4) Epoch 7, batch 11250, loss[loss=0.2037, simple_loss=0.2965, pruned_loss=0.05549, over 21626.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.285, pruned_loss=0.07023, over 4247089.24 frames. ], batch size: 230, lr: 4.35e-03, grad_scale: 32.0 2023-06-24 19:14:43,179 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1165302.0, ans=0.125 2023-06-24 19:14:52,868 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.40 vs. limit=22.5 2023-06-24 19:15:25,420 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.35 vs. limit=15.0 2023-06-24 19:15:26,327 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1165422.0, ans=0.125 2023-06-24 19:15:45,624 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1165482.0, ans=0.125 2023-06-24 19:16:29,535 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1165602.0, ans=0.025 2023-06-24 19:16:31,013 INFO [train.py:996] (0/4) Epoch 7, batch 11300, loss[loss=0.2034, simple_loss=0.2724, pruned_loss=0.0672, over 21786.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2863, pruned_loss=0.07089, over 4251957.77 frames. ], batch size: 247, lr: 4.35e-03, grad_scale: 32.0 2023-06-24 19:17:13,935 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.273e+02 2.821e+02 3.305e+02 4.579e+02 7.835e+02, threshold=6.611e+02, percent-clipped=6.0 2023-06-24 19:17:36,132 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1165782.0, ans=0.125 2023-06-24 19:18:19,935 INFO [train.py:996] (0/4) Epoch 7, batch 11350, loss[loss=0.215, simple_loss=0.2992, pruned_loss=0.06542, over 21427.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2878, pruned_loss=0.07038, over 4258071.95 frames. ], batch size: 194, lr: 4.35e-03, grad_scale: 32.0 2023-06-24 19:18:53,635 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.43 vs. limit=15.0 2023-06-24 19:18:54,511 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1165962.0, ans=0.125 2023-06-24 19:19:03,608 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.26 vs. limit=15.0 2023-06-24 19:19:07,150 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1166022.0, ans=0.2 2023-06-24 19:19:12,161 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1166022.0, ans=0.0 2023-06-24 19:19:59,866 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1166142.0, ans=0.1 2023-06-24 19:20:11,170 INFO [train.py:996] (0/4) Epoch 7, batch 11400, loss[loss=0.2135, simple_loss=0.2988, pruned_loss=0.0641, over 21610.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.2935, pruned_loss=0.07262, over 4264399.37 frames. ], batch size: 230, lr: 4.35e-03, grad_scale: 16.0 2023-06-24 19:20:56,087 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.109e+02 2.882e+02 3.810e+02 4.991e+02 7.494e+02, threshold=7.619e+02, percent-clipped=6.0 2023-06-24 19:20:56,505 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1166322.0, ans=0.0 2023-06-24 19:21:05,806 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1166322.0, ans=0.125 2023-06-24 19:21:16,956 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.55 vs. limit=6.0 2023-06-24 19:21:25,857 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1166382.0, ans=0.0 2023-06-24 19:22:06,443 INFO [train.py:996] (0/4) Epoch 7, batch 11450, loss[loss=0.1912, simple_loss=0.2759, pruned_loss=0.05323, over 21865.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2925, pruned_loss=0.07087, over 4265894.06 frames. ], batch size: 316, lr: 4.35e-03, grad_scale: 16.0 2023-06-24 19:22:29,968 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1166562.0, ans=0.125 2023-06-24 19:22:32,968 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1166562.0, ans=0.0 2023-06-24 19:23:07,037 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1166622.0, ans=0.125 2023-06-24 19:23:19,319 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1166682.0, ans=0.125 2023-06-24 19:23:59,144 INFO [train.py:996] (0/4) Epoch 7, batch 11500, loss[loss=0.1837, simple_loss=0.282, pruned_loss=0.04274, over 21780.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.2973, pruned_loss=0.07255, over 4269027.33 frames. ], batch size: 282, lr: 4.35e-03, grad_scale: 16.0 2023-06-24 19:24:42,429 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1166862.0, ans=0.125 2023-06-24 19:24:44,986 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.002e+02 2.827e+02 3.371e+02 4.045e+02 6.932e+02, threshold=6.743e+02, percent-clipped=0.0 2023-06-24 19:25:33,717 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1167042.0, ans=0.125 2023-06-24 19:25:56,896 INFO [train.py:996] (0/4) Epoch 7, batch 11550, loss[loss=0.2356, simple_loss=0.333, pruned_loss=0.06909, over 21884.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.3027, pruned_loss=0.07241, over 4267384.94 frames. ], batch size: 316, lr: 4.35e-03, grad_scale: 16.0 2023-06-24 19:26:55,079 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1167222.0, ans=0.125 2023-06-24 19:27:05,164 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.91 vs. limit=15.0 2023-06-24 19:27:37,729 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1167342.0, ans=0.05 2023-06-24 19:27:48,862 INFO [train.py:996] (0/4) Epoch 7, batch 11600, loss[loss=0.2362, simple_loss=0.3439, pruned_loss=0.06424, over 21798.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3182, pruned_loss=0.07506, over 4273612.89 frames. ], batch size: 282, lr: 4.35e-03, grad_scale: 32.0 2023-06-24 19:28:03,771 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 19:28:05,367 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1167402.0, ans=0.0 2023-06-24 19:28:07,011 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1167402.0, ans=0.04949747468305833 2023-06-24 19:28:07,071 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1167402.0, ans=0.125 2023-06-24 19:28:24,998 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1167462.0, ans=0.0 2023-06-24 19:28:34,721 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.256e+02 2.839e+02 3.611e+02 4.809e+02 8.575e+02, threshold=7.221e+02, percent-clipped=4.0 2023-06-24 19:28:54,849 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1167522.0, ans=0.1 2023-06-24 19:29:42,827 INFO [train.py:996] (0/4) Epoch 7, batch 11650, loss[loss=0.2326, simple_loss=0.31, pruned_loss=0.07754, over 21374.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3249, pruned_loss=0.07566, over 4269501.63 frames. ], batch size: 194, lr: 4.35e-03, grad_scale: 32.0 2023-06-24 19:29:47,035 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1167702.0, ans=0.1 2023-06-24 19:29:54,271 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.80 vs. limit=15.0 2023-06-24 19:29:59,259 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1167762.0, ans=0.0 2023-06-24 19:30:01,245 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1167762.0, ans=0.125 2023-06-24 19:30:37,628 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1167822.0, ans=0.035 2023-06-24 19:30:46,463 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.38 vs. limit=15.0 2023-06-24 19:31:06,535 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1167882.0, ans=0.0 2023-06-24 19:31:33,862 INFO [train.py:996] (0/4) Epoch 7, batch 11700, loss[loss=0.1941, simple_loss=0.2623, pruned_loss=0.06296, over 21605.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3168, pruned_loss=0.07577, over 4265211.31 frames. ], batch size: 298, lr: 4.35e-03, grad_scale: 32.0 2023-06-24 19:32:16,016 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.211e+02 2.666e+02 3.050e+02 3.571e+02 8.433e+02, threshold=6.100e+02, percent-clipped=2.0 2023-06-24 19:33:20,119 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.90 vs. limit=6.0 2023-06-24 19:33:22,090 INFO [train.py:996] (0/4) Epoch 7, batch 11750, loss[loss=0.2373, simple_loss=0.3075, pruned_loss=0.08352, over 21811.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3073, pruned_loss=0.0747, over 4263165.92 frames. ], batch size: 372, lr: 4.35e-03, grad_scale: 16.0 2023-06-24 19:33:59,748 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.67 vs. limit=15.0 2023-06-24 19:34:00,784 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1168362.0, ans=0.125 2023-06-24 19:34:45,003 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1168482.0, ans=0.125 2023-06-24 19:35:14,541 INFO [train.py:996] (0/4) Epoch 7, batch 11800, loss[loss=0.2809, simple_loss=0.3561, pruned_loss=0.1029, over 21421.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.3081, pruned_loss=0.07686, over 4258817.47 frames. ], batch size: 471, lr: 4.35e-03, grad_scale: 16.0 2023-06-24 19:35:23,252 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.48 vs. limit=15.0 2023-06-24 19:36:03,574 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.166e+02 2.959e+02 3.685e+02 4.448e+02 7.783e+02, threshold=7.370e+02, percent-clipped=3.0 2023-06-24 19:36:06,554 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.76 vs. limit=15.0 2023-06-24 19:37:04,943 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1168902.0, ans=0.1 2023-06-24 19:37:05,808 INFO [train.py:996] (0/4) Epoch 7, batch 11850, loss[loss=0.2432, simple_loss=0.3644, pruned_loss=0.06103, over 19785.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3085, pruned_loss=0.07514, over 4259252.48 frames. ], batch size: 702, lr: 4.35e-03, grad_scale: 16.0 2023-06-24 19:37:38,695 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1168962.0, ans=0.0 2023-06-24 19:38:06,987 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1169022.0, ans=0.0 2023-06-24 19:38:10,843 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1169022.0, ans=0.1 2023-06-24 19:39:02,913 INFO [train.py:996] (0/4) Epoch 7, batch 11900, loss[loss=0.1979, simple_loss=0.2862, pruned_loss=0.05478, over 21836.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.309, pruned_loss=0.07281, over 4258778.80 frames. ], batch size: 316, lr: 4.35e-03, grad_scale: 16.0 2023-06-24 19:39:21,033 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1169202.0, ans=0.125 2023-06-24 19:39:51,155 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.045e+02 2.709e+02 3.163e+02 3.879e+02 8.042e+02, threshold=6.325e+02, percent-clipped=1.0 2023-06-24 19:39:52,200 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1169322.0, ans=0.1 2023-06-24 19:39:55,498 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1169322.0, ans=0.125 2023-06-24 19:40:29,009 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1169442.0, ans=0.125 2023-06-24 19:40:45,006 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1169442.0, ans=0.125 2023-06-24 19:40:58,219 INFO [train.py:996] (0/4) Epoch 7, batch 11950, loss[loss=0.2422, simple_loss=0.3345, pruned_loss=0.07494, over 21448.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.3099, pruned_loss=0.07029, over 4264703.80 frames. ], batch size: 507, lr: 4.35e-03, grad_scale: 16.0 2023-06-24 19:41:44,127 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.24 vs. limit=15.0 2023-06-24 19:42:40,543 INFO [train.py:996] (0/4) Epoch 7, batch 12000, loss[loss=0.2078, simple_loss=0.2662, pruned_loss=0.07465, over 21202.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.3041, pruned_loss=0.06905, over 4243147.78 frames. ], batch size: 144, lr: 4.34e-03, grad_scale: 32.0 2023-06-24 19:42:40,544 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-24 19:43:01,778 INFO [train.py:1028] (0/4) Epoch 7, validation: loss=0.261, simple_loss=0.3543, pruned_loss=0.08379, over 1796401.00 frames. 2023-06-24 19:43:01,780 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23616MB 2023-06-24 19:43:28,604 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.21 vs. limit=15.0 2023-06-24 19:43:34,790 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1169862.0, ans=0.0 2023-06-24 19:43:44,293 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.005e+02 2.826e+02 3.232e+02 4.022e+02 5.951e+02, threshold=6.465e+02, percent-clipped=0.0 2023-06-24 19:43:44,953 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1169922.0, ans=0.2 2023-06-24 19:43:45,562 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.79 vs. limit=22.5 2023-06-24 19:43:48,959 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.45 vs. limit=15.0 2023-06-24 19:44:19,846 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1169982.0, ans=0.1 2023-06-24 19:44:37,793 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1170042.0, ans=0.125 2023-06-24 19:44:56,213 INFO [train.py:996] (0/4) Epoch 7, batch 12050, loss[loss=0.2585, simple_loss=0.3293, pruned_loss=0.09383, over 21892.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.2991, pruned_loss=0.07053, over 4254501.89 frames. ], batch size: 118, lr: 4.34e-03, grad_scale: 16.0 2023-06-24 19:45:14,629 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1170162.0, ans=0.07 2023-06-24 19:45:45,056 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1170222.0, ans=0.0 2023-06-24 19:46:12,365 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1170282.0, ans=0.125 2023-06-24 19:46:41,794 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.68 vs. limit=22.5 2023-06-24 19:46:48,313 INFO [train.py:996] (0/4) Epoch 7, batch 12100, loss[loss=0.2405, simple_loss=0.3112, pruned_loss=0.08491, over 21756.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.3036, pruned_loss=0.07485, over 4262091.79 frames. ], batch size: 351, lr: 4.34e-03, grad_scale: 16.0 2023-06-24 19:47:05,209 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1170462.0, ans=0.125 2023-06-24 19:47:24,364 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.76 vs. limit=15.0 2023-06-24 19:47:33,918 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.129e+02 3.018e+02 3.555e+02 4.988e+02 8.352e+02, threshold=7.110e+02, percent-clipped=5.0 2023-06-24 19:48:07,162 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.67 vs. limit=10.0 2023-06-24 19:48:08,751 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.00 vs. limit=6.0 2023-06-24 19:48:36,469 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1170642.0, ans=0.0 2023-06-24 19:48:41,167 INFO [train.py:996] (0/4) Epoch 7, batch 12150, loss[loss=0.2, simple_loss=0.2904, pruned_loss=0.05481, over 21729.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.3067, pruned_loss=0.07393, over 4267723.44 frames. ], batch size: 247, lr: 4.34e-03, grad_scale: 16.0 2023-06-24 19:48:57,929 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1170702.0, ans=0.125 2023-06-24 19:48:58,014 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1170702.0, ans=0.2 2023-06-24 19:49:18,199 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1170762.0, ans=0.125 2023-06-24 19:49:26,593 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1170822.0, ans=0.0 2023-06-24 19:49:52,850 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1170882.0, ans=0.125 2023-06-24 19:49:52,863 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1170882.0, ans=0.0 2023-06-24 19:50:19,753 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1170942.0, ans=0.125 2023-06-24 19:50:30,886 INFO [train.py:996] (0/4) Epoch 7, batch 12200, loss[loss=0.2145, simple_loss=0.2768, pruned_loss=0.07614, over 21222.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.302, pruned_loss=0.07321, over 4262813.35 frames. ], batch size: 160, lr: 4.34e-03, grad_scale: 16.0 2023-06-24 19:50:45,304 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1171002.0, ans=0.1 2023-06-24 19:50:52,466 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1171062.0, ans=0.125 2023-06-24 19:51:02,865 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1171062.0, ans=0.0 2023-06-24 19:51:25,416 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.212e+02 3.032e+02 3.828e+02 4.856e+02 1.056e+03, threshold=7.657e+02, percent-clipped=7.0 2023-06-24 19:51:41,337 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1171182.0, ans=0.125 2023-06-24 19:52:06,124 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=1171242.0, ans=22.5 2023-06-24 19:52:09,260 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.90 vs. limit=6.0 2023-06-24 19:52:18,168 INFO [train.py:996] (0/4) Epoch 7, batch 12250, loss[loss=0.1869, simple_loss=0.2736, pruned_loss=0.05012, over 21678.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2949, pruned_loss=0.0702, over 4271083.66 frames. ], batch size: 391, lr: 4.34e-03, grad_scale: 16.0 2023-06-24 19:52:59,688 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1171422.0, ans=0.0 2023-06-24 19:54:06,967 INFO [train.py:996] (0/4) Epoch 7, batch 12300, loss[loss=0.2542, simple_loss=0.3441, pruned_loss=0.0821, over 21695.00 frames. ], tot_loss[loss=0.209, simple_loss=0.2877, pruned_loss=0.06509, over 4276158.99 frames. ], batch size: 414, lr: 4.34e-03, grad_scale: 16.0 2023-06-24 19:54:23,228 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1171602.0, ans=10.0 2023-06-24 19:54:56,066 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.637e+02 2.162e+02 2.543e+02 3.041e+02 6.823e+02, threshold=5.086e+02, percent-clipped=0.0 2023-06-24 19:55:17,267 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1171782.0, ans=0.125 2023-06-24 19:55:48,540 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1171842.0, ans=0.125 2023-06-24 19:55:54,580 INFO [train.py:996] (0/4) Epoch 7, batch 12350, loss[loss=0.2354, simple_loss=0.3021, pruned_loss=0.08436, over 20822.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.2947, pruned_loss=0.06724, over 4277735.97 frames. ], batch size: 608, lr: 4.34e-03, grad_scale: 16.0 2023-06-24 19:55:55,233 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1171902.0, ans=0.2 2023-06-24 19:55:56,945 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1171902.0, ans=0.125 2023-06-24 19:56:21,445 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.84 vs. limit=15.0 2023-06-24 19:56:30,675 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1171962.0, ans=0.0 2023-06-24 19:56:30,711 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1171962.0, ans=0.0 2023-06-24 19:56:45,158 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1172022.0, ans=0.035 2023-06-24 19:57:42,412 INFO [train.py:996] (0/4) Epoch 7, batch 12400, loss[loss=0.2145, simple_loss=0.2847, pruned_loss=0.07211, over 21804.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.2974, pruned_loss=0.07038, over 4282955.01 frames. ], batch size: 247, lr: 4.34e-03, grad_scale: 32.0 2023-06-24 19:58:37,883 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.337e+02 2.786e+02 3.157e+02 3.873e+02 7.298e+02, threshold=6.314e+02, percent-clipped=10.0 2023-06-24 19:58:50,622 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.46 vs. limit=6.0 2023-06-24 19:59:00,224 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1172382.0, ans=0.0 2023-06-24 19:59:05,885 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1172382.0, ans=0.125 2023-06-24 19:59:33,082 INFO [train.py:996] (0/4) Epoch 7, batch 12450, loss[loss=0.2701, simple_loss=0.3387, pruned_loss=0.1007, over 21256.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.3007, pruned_loss=0.07305, over 4284881.51 frames. ], batch size: 143, lr: 4.34e-03, grad_scale: 32.0 2023-06-24 19:59:54,562 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.55 vs. limit=15.0 2023-06-24 20:00:06,536 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1172562.0, ans=0.125 2023-06-24 20:00:18,419 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1172562.0, ans=0.0 2023-06-24 20:00:19,950 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1172562.0, ans=0.2 2023-06-24 20:00:23,579 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1172562.0, ans=0.125 2023-06-24 20:00:56,938 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1172682.0, ans=0.05 2023-06-24 20:01:23,653 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.73 vs. limit=22.5 2023-06-24 20:01:30,069 INFO [train.py:996] (0/4) Epoch 7, batch 12500, loss[loss=0.2746, simple_loss=0.3588, pruned_loss=0.09517, over 21231.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3109, pruned_loss=0.07598, over 4285954.68 frames. ], batch size: 143, lr: 4.34e-03, grad_scale: 32.0 2023-06-24 20:01:42,886 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1172802.0, ans=0.0 2023-06-24 20:02:05,911 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=1172862.0, ans=0.05 2023-06-24 20:02:19,539 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1172922.0, ans=0.125 2023-06-24 20:02:24,571 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.454e+02 3.093e+02 3.470e+02 4.423e+02 7.018e+02, threshold=6.940e+02, percent-clipped=1.0 2023-06-24 20:03:31,028 INFO [train.py:996] (0/4) Epoch 7, batch 12550, loss[loss=0.2312, simple_loss=0.3122, pruned_loss=0.07507, over 21609.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3144, pruned_loss=0.07775, over 4282009.97 frames. ], batch size: 389, lr: 4.34e-03, grad_scale: 32.0 2023-06-24 20:04:08,458 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1173162.0, ans=0.125 2023-06-24 20:04:10,619 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1173222.0, ans=0.125 2023-06-24 20:04:13,580 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1173222.0, ans=0.2 2023-06-24 20:04:40,483 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1173282.0, ans=0.0 2023-06-24 20:05:21,012 INFO [train.py:996] (0/4) Epoch 7, batch 12600, loss[loss=0.1921, simple_loss=0.2791, pruned_loss=0.05261, over 21613.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3121, pruned_loss=0.07563, over 4267295.63 frames. ], batch size: 230, lr: 4.34e-03, grad_scale: 32.0 2023-06-24 20:05:21,549 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1173402.0, ans=0.0 2023-06-24 20:05:38,705 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1173402.0, ans=0.0 2023-06-24 20:06:05,573 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.874e+02 2.821e+02 3.460e+02 4.531e+02 8.641e+02, threshold=6.920e+02, percent-clipped=2.0 2023-06-24 20:06:42,881 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1173582.0, ans=0.125 2023-06-24 20:07:13,627 INFO [train.py:996] (0/4) Epoch 7, batch 12650, loss[loss=0.2433, simple_loss=0.3021, pruned_loss=0.09224, over 21335.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.3038, pruned_loss=0.07155, over 4269598.23 frames. ], batch size: 159, lr: 4.34e-03, grad_scale: 16.0 2023-06-24 20:07:25,990 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1173702.0, ans=0.1 2023-06-24 20:07:58,686 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1173822.0, ans=0.035 2023-06-24 20:09:02,180 INFO [train.py:996] (0/4) Epoch 7, batch 12700, loss[loss=0.1945, simple_loss=0.2689, pruned_loss=0.06008, over 21124.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.3026, pruned_loss=0.07394, over 4271369.90 frames. ], batch size: 608, lr: 4.34e-03, grad_scale: 16.0 2023-06-24 20:09:06,250 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1174002.0, ans=0.0 2023-06-24 20:09:12,136 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1174002.0, ans=0.07 2023-06-24 20:09:45,244 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1174122.0, ans=0.2 2023-06-24 20:09:47,817 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.268e+02 2.796e+02 3.277e+02 3.938e+02 5.852e+02, threshold=6.553e+02, percent-clipped=0.0 2023-06-24 20:09:57,413 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.28 vs. limit=6.0 2023-06-24 20:10:50,792 INFO [train.py:996] (0/4) Epoch 7, batch 12750, loss[loss=0.2134, simple_loss=0.3031, pruned_loss=0.06182, over 21684.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.3038, pruned_loss=0.0742, over 4267972.58 frames. ], batch size: 389, lr: 4.34e-03, grad_scale: 16.0 2023-06-24 20:10:52,924 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=1174302.0, ans=10.0 2023-06-24 20:10:54,677 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 20:11:01,912 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1174302.0, ans=0.125 2023-06-24 20:12:01,708 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.76 vs. limit=15.0 2023-06-24 20:12:32,936 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1174542.0, ans=0.125 2023-06-24 20:12:39,037 INFO [train.py:996] (0/4) Epoch 7, batch 12800, loss[loss=0.2132, simple_loss=0.2959, pruned_loss=0.06527, over 21230.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.3035, pruned_loss=0.07471, over 4267043.14 frames. ], batch size: 176, lr: 4.34e-03, grad_scale: 32.0 2023-06-24 20:13:25,312 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.095e+02 2.978e+02 3.549e+02 4.677e+02 8.571e+02, threshold=7.098e+02, percent-clipped=5.0 2023-06-24 20:13:37,569 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.36 vs. limit=15.0 2023-06-24 20:13:56,564 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1174782.0, ans=0.125 2023-06-24 20:14:10,093 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1174842.0, ans=0.0 2023-06-24 20:14:18,225 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1174842.0, ans=0.0 2023-06-24 20:14:25,198 INFO [train.py:996] (0/4) Epoch 7, batch 12850, loss[loss=0.2017, simple_loss=0.3022, pruned_loss=0.05064, over 21746.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3061, pruned_loss=0.07655, over 4269233.42 frames. ], batch size: 351, lr: 4.34e-03, grad_scale: 32.0 2023-06-24 20:14:39,558 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1174902.0, ans=0.1 2023-06-24 20:16:06,298 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.60 vs. limit=22.5 2023-06-24 20:16:13,153 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1175142.0, ans=0.025 2023-06-24 20:16:16,221 INFO [train.py:996] (0/4) Epoch 7, batch 12900, loss[loss=0.2055, simple_loss=0.2766, pruned_loss=0.06724, over 21066.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.3034, pruned_loss=0.07347, over 4268652.10 frames. ], batch size: 608, lr: 4.34e-03, grad_scale: 16.0 2023-06-24 20:17:00,989 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1175322.0, ans=0.125 2023-06-24 20:17:14,882 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.015e+02 2.556e+02 2.922e+02 3.625e+02 8.221e+02, threshold=5.845e+02, percent-clipped=4.0 2023-06-24 20:17:43,558 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1175442.0, ans=0.125 2023-06-24 20:18:05,584 INFO [train.py:996] (0/4) Epoch 7, batch 12950, loss[loss=0.2072, simple_loss=0.2868, pruned_loss=0.06376, over 21509.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.3018, pruned_loss=0.07198, over 4265102.06 frames. ], batch size: 131, lr: 4.33e-03, grad_scale: 16.0 2023-06-24 20:18:23,570 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1175502.0, ans=0.0 2023-06-24 20:18:25,660 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.26 vs. limit=22.5 2023-06-24 20:18:30,828 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1175562.0, ans=0.0 2023-06-24 20:19:10,624 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1175622.0, ans=0.125 2023-06-24 20:19:12,211 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1175682.0, ans=0.125 2023-06-24 20:19:15,001 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.85 vs. limit=12.0 2023-06-24 20:19:53,394 INFO [train.py:996] (0/4) Epoch 7, batch 13000, loss[loss=0.172, simple_loss=0.255, pruned_loss=0.04454, over 21770.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.3016, pruned_loss=0.07208, over 4274611.55 frames. ], batch size: 282, lr: 4.33e-03, grad_scale: 16.0 2023-06-24 20:19:58,415 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.98 vs. limit=15.0 2023-06-24 20:20:50,813 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.916e+02 2.748e+02 3.242e+02 4.275e+02 7.846e+02, threshold=6.485e+02, percent-clipped=8.0 2023-06-24 20:21:05,204 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-196000.pt 2023-06-24 20:21:21,423 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1176042.0, ans=0.5 2023-06-24 20:21:43,443 INFO [train.py:996] (0/4) Epoch 7, batch 13050, loss[loss=0.2432, simple_loss=0.3099, pruned_loss=0.08831, over 21769.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.2987, pruned_loss=0.0694, over 4266331.81 frames. ], batch size: 441, lr: 4.33e-03, grad_scale: 16.0 2023-06-24 20:23:00,002 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1176282.0, ans=0.125 2023-06-24 20:23:31,385 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1176402.0, ans=0.125 2023-06-24 20:23:32,696 INFO [train.py:996] (0/4) Epoch 7, batch 13100, loss[loss=0.2654, simple_loss=0.3431, pruned_loss=0.09386, over 21785.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.3008, pruned_loss=0.06928, over 4269478.37 frames. ], batch size: 124, lr: 4.33e-03, grad_scale: 16.0 2023-06-24 20:23:35,444 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.93 vs. limit=6.0 2023-06-24 20:23:45,891 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1176402.0, ans=0.125 2023-06-24 20:24:28,646 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1176522.0, ans=0.125 2023-06-24 20:24:31,410 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.070e+02 2.745e+02 3.057e+02 3.676e+02 6.184e+02, threshold=6.113e+02, percent-clipped=0.0 2023-06-24 20:25:33,949 INFO [train.py:996] (0/4) Epoch 7, batch 13150, loss[loss=0.1975, simple_loss=0.268, pruned_loss=0.06349, over 21184.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.3041, pruned_loss=0.0725, over 4272250.37 frames. ], batch size: 143, lr: 4.33e-03, grad_scale: 16.0 2023-06-24 20:25:59,368 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1176762.0, ans=0.0 2023-06-24 20:26:20,556 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1176822.0, ans=0.125 2023-06-24 20:27:10,873 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.14 vs. limit=12.0 2023-06-24 20:27:11,177 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.45 vs. limit=6.0 2023-06-24 20:27:28,471 INFO [train.py:996] (0/4) Epoch 7, batch 13200, loss[loss=0.2333, simple_loss=0.3028, pruned_loss=0.08191, over 21848.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.3011, pruned_loss=0.07234, over 4265950.34 frames. ], batch size: 282, lr: 4.33e-03, grad_scale: 32.0 2023-06-24 20:27:45,527 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1177062.0, ans=0.125 2023-06-24 20:27:57,159 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1177062.0, ans=0.0 2023-06-24 20:28:17,663 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.491e+02 2.990e+02 3.679e+02 4.765e+02 8.248e+02, threshold=7.359e+02, percent-clipped=11.0 2023-06-24 20:28:18,696 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1177122.0, ans=0.0 2023-06-24 20:28:59,258 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1177242.0, ans=0.125 2023-06-24 20:29:11,465 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1177242.0, ans=0.2 2023-06-24 20:29:17,395 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1177302.0, ans=0.125 2023-06-24 20:29:18,292 INFO [train.py:996] (0/4) Epoch 7, batch 13250, loss[loss=0.2537, simple_loss=0.3355, pruned_loss=0.08592, over 20680.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.3017, pruned_loss=0.0741, over 4269558.35 frames. ], batch size: 607, lr: 4.33e-03, grad_scale: 16.0 2023-06-24 20:29:58,787 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1177422.0, ans=0.2 2023-06-24 20:30:02,887 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.20 vs. limit=22.5 2023-06-24 20:30:32,096 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.14 vs. limit=15.0 2023-06-24 20:31:09,732 INFO [train.py:996] (0/4) Epoch 7, batch 13300, loss[loss=0.2452, simple_loss=0.3273, pruned_loss=0.08155, over 21685.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.303, pruned_loss=0.0729, over 4269933.61 frames. ], batch size: 298, lr: 4.33e-03, grad_scale: 16.0 2023-06-24 20:31:45,507 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1177662.0, ans=0.05 2023-06-24 20:31:47,157 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1177662.0, ans=0.125 2023-06-24 20:31:59,164 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1177722.0, ans=0.1 2023-06-24 20:32:10,685 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.218e+02 2.867e+02 3.500e+02 4.353e+02 7.353e+02, threshold=7.001e+02, percent-clipped=0.0 2023-06-24 20:32:32,406 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_na.min_abs, batch_count=1177782.0, ans=0.02 2023-06-24 20:32:38,414 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1177782.0, ans=0.0 2023-06-24 20:32:54,113 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1177842.0, ans=0.125 2023-06-24 20:32:55,705 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1177842.0, ans=0.0 2023-06-24 20:33:00,269 INFO [train.py:996] (0/4) Epoch 7, batch 13350, loss[loss=0.2756, simple_loss=0.3462, pruned_loss=0.1025, over 21745.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.308, pruned_loss=0.07544, over 4267858.89 frames. ], batch size: 124, lr: 4.33e-03, grad_scale: 16.0 2023-06-24 20:34:00,630 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1178022.0, ans=0.2 2023-06-24 20:34:48,871 INFO [train.py:996] (0/4) Epoch 7, batch 13400, loss[loss=0.2476, simple_loss=0.3266, pruned_loss=0.08427, over 21320.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3092, pruned_loss=0.07653, over 4271027.13 frames. ], batch size: 548, lr: 4.33e-03, grad_scale: 16.0 2023-06-24 20:35:27,797 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.01 vs. limit=10.0 2023-06-24 20:35:51,744 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1178322.0, ans=0.1 2023-06-24 20:35:54,570 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.196e+02 2.870e+02 3.236e+02 3.893e+02 7.079e+02, threshold=6.472e+02, percent-clipped=1.0 2023-06-24 20:36:12,547 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1178382.0, ans=0.125 2023-06-24 20:36:31,790 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1178442.0, ans=0.125 2023-06-24 20:36:43,548 INFO [train.py:996] (0/4) Epoch 7, batch 13450, loss[loss=0.2073, simple_loss=0.2711, pruned_loss=0.0717, over 21589.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3094, pruned_loss=0.07841, over 4274221.33 frames. ], batch size: 230, lr: 4.33e-03, grad_scale: 16.0 2023-06-24 20:37:10,525 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1178562.0, ans=0.125 2023-06-24 20:38:05,992 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.07 vs. limit=15.0 2023-06-24 20:38:11,328 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.10 vs. limit=10.0 2023-06-24 20:38:16,670 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.93 vs. limit=15.0 2023-06-24 20:38:17,743 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1178742.0, ans=0.125 2023-06-24 20:38:33,354 INFO [train.py:996] (0/4) Epoch 7, batch 13500, loss[loss=0.218, simple_loss=0.2998, pruned_loss=0.06813, over 21922.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3005, pruned_loss=0.07617, over 4268889.23 frames. ], batch size: 317, lr: 4.33e-03, grad_scale: 16.0 2023-06-24 20:38:33,956 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1178802.0, ans=0.125 2023-06-24 20:38:40,751 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1178802.0, ans=0.125 2023-06-24 20:39:05,478 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1178862.0, ans=10.0 2023-06-24 20:39:14,791 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1178862.0, ans=0.0 2023-06-24 20:39:25,444 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1178922.0, ans=0.07 2023-06-24 20:39:35,911 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.421e+02 3.362e+02 3.847e+02 4.790e+02 7.815e+02, threshold=7.695e+02, percent-clipped=4.0 2023-06-24 20:39:55,742 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1178982.0, ans=0.0 2023-06-24 20:40:25,963 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1179042.0, ans=0.125 2023-06-24 20:40:30,482 INFO [train.py:996] (0/4) Epoch 7, batch 13550, loss[loss=0.2445, simple_loss=0.3371, pruned_loss=0.07591, over 20703.00 frames. ], tot_loss[loss=0.228, simple_loss=0.3045, pruned_loss=0.07579, over 4268101.58 frames. ], batch size: 607, lr: 4.33e-03, grad_scale: 16.0 2023-06-24 20:41:09,378 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1179162.0, ans=0.125 2023-06-24 20:41:18,016 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1179222.0, ans=0.125 2023-06-24 20:42:02,189 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1179342.0, ans=0.0 2023-06-24 20:42:19,514 INFO [train.py:996] (0/4) Epoch 7, batch 13600, loss[loss=0.2196, simple_loss=0.2896, pruned_loss=0.0748, over 21308.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3062, pruned_loss=0.07608, over 4275156.27 frames. ], batch size: 159, lr: 4.33e-03, grad_scale: 32.0 2023-06-24 20:43:13,910 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.974e+02 2.752e+02 3.319e+02 4.170e+02 8.424e+02, threshold=6.637e+02, percent-clipped=2.0 2023-06-24 20:43:21,955 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1179582.0, ans=0.0 2023-06-24 20:44:13,938 INFO [train.py:996] (0/4) Epoch 7, batch 13650, loss[loss=0.1961, simple_loss=0.2634, pruned_loss=0.06446, over 21637.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.3025, pruned_loss=0.07333, over 4272323.07 frames. ], batch size: 332, lr: 4.33e-03, grad_scale: 32.0 2023-06-24 20:44:18,195 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1179702.0, ans=0.125 2023-06-24 20:44:38,674 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1179762.0, ans=10.0 2023-06-24 20:44:55,663 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1179822.0, ans=0.2 2023-06-24 20:45:02,926 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1179822.0, ans=0.125 2023-06-24 20:45:14,247 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1179882.0, ans=0.0 2023-06-24 20:45:39,970 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.74 vs. limit=12.0 2023-06-24 20:46:02,789 INFO [train.py:996] (0/4) Epoch 7, batch 13700, loss[loss=0.2199, simple_loss=0.3009, pruned_loss=0.06946, over 21743.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.2973, pruned_loss=0.07294, over 4274228.13 frames. ], batch size: 351, lr: 4.33e-03, grad_scale: 16.0 2023-06-24 20:46:10,786 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.85 vs. limit=12.0 2023-06-24 20:46:53,963 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.072e+02 2.933e+02 3.408e+02 4.386e+02 8.480e+02, threshold=6.816e+02, percent-clipped=3.0 2023-06-24 20:46:57,877 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1180122.0, ans=0.2 2023-06-24 20:47:09,347 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.31 vs. limit=15.0 2023-06-24 20:47:25,604 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1180182.0, ans=0.0 2023-06-24 20:47:45,517 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1180242.0, ans=0.1 2023-06-24 20:47:58,387 INFO [train.py:996] (0/4) Epoch 7, batch 13750, loss[loss=0.2638, simple_loss=0.3405, pruned_loss=0.09355, over 21399.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.2955, pruned_loss=0.07241, over 4265531.08 frames. ], batch size: 471, lr: 4.33e-03, grad_scale: 16.0 2023-06-24 20:48:19,736 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1180362.0, ans=0.125 2023-06-24 20:48:36,448 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1180422.0, ans=0.0 2023-06-24 20:49:16,736 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1180482.0, ans=0.1 2023-06-24 20:49:51,685 INFO [train.py:996] (0/4) Epoch 7, batch 13800, loss[loss=0.3109, simple_loss=0.4064, pruned_loss=0.1076, over 21523.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.3028, pruned_loss=0.07205, over 4270325.98 frames. ], batch size: 471, lr: 4.33e-03, grad_scale: 16.0 2023-06-24 20:50:55,001 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.264e+02 2.819e+02 3.661e+02 5.277e+02 1.106e+03, threshold=7.321e+02, percent-clipped=8.0 2023-06-24 20:50:55,676 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1180722.0, ans=0.0 2023-06-24 20:51:42,293 INFO [train.py:996] (0/4) Epoch 7, batch 13850, loss[loss=0.1845, simple_loss=0.2514, pruned_loss=0.05875, over 20788.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.307, pruned_loss=0.07262, over 4262677.55 frames. ], batch size: 608, lr: 4.32e-03, grad_scale: 16.0 2023-06-24 20:52:16,590 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1180962.0, ans=0.125 2023-06-24 20:53:13,234 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.53 vs. limit=6.0 2023-06-24 20:53:19,912 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1181142.0, ans=0.125 2023-06-24 20:53:30,399 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1181142.0, ans=0.125 2023-06-24 20:53:33,225 INFO [train.py:996] (0/4) Epoch 7, batch 13900, loss[loss=0.2694, simple_loss=0.3176, pruned_loss=0.1106, over 21723.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3107, pruned_loss=0.07613, over 4264908.68 frames. ], batch size: 508, lr: 4.32e-03, grad_scale: 16.0 2023-06-24 20:54:28,632 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1181322.0, ans=0.07 2023-06-24 20:54:34,864 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.286e+02 3.148e+02 3.792e+02 4.891e+02 9.530e+02, threshold=7.583e+02, percent-clipped=4.0 2023-06-24 20:54:49,018 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1181382.0, ans=0.125 2023-06-24 20:55:22,186 INFO [train.py:996] (0/4) Epoch 7, batch 13950, loss[loss=0.2326, simple_loss=0.3138, pruned_loss=0.07573, over 21797.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3112, pruned_loss=0.07744, over 4273621.91 frames. ], batch size: 351, lr: 4.32e-03, grad_scale: 16.0 2023-06-24 20:55:22,674 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 20:55:41,606 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1181502.0, ans=0.125 2023-06-24 20:55:50,380 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1181562.0, ans=0.0 2023-06-24 20:56:09,290 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1181562.0, ans=10.0 2023-06-24 20:56:45,702 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1181682.0, ans=0.125 2023-06-24 20:56:58,003 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1181742.0, ans=0.1 2023-06-24 20:57:09,146 INFO [train.py:996] (0/4) Epoch 7, batch 14000, loss[loss=0.2432, simple_loss=0.3326, pruned_loss=0.0769, over 21576.00 frames. ], tot_loss[loss=0.23, simple_loss=0.3091, pruned_loss=0.07543, over 4274008.19 frames. ], batch size: 471, lr: 4.32e-03, grad_scale: 32.0 2023-06-24 20:58:03,240 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1181922.0, ans=0.0 2023-06-24 20:58:14,896 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.195e+02 2.939e+02 3.299e+02 3.866e+02 1.368e+03, threshold=6.598e+02, percent-clipped=4.0 2023-06-24 20:58:29,421 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1181982.0, ans=0.125 2023-06-24 20:58:56,591 INFO [train.py:996] (0/4) Epoch 7, batch 14050, loss[loss=0.1873, simple_loss=0.2584, pruned_loss=0.05815, over 15360.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.3041, pruned_loss=0.07209, over 4267086.99 frames. ], batch size: 60, lr: 4.32e-03, grad_scale: 32.0 2023-06-24 20:59:25,314 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1182162.0, ans=0.2 2023-06-24 20:59:32,115 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1182162.0, ans=0.125 2023-06-24 20:59:49,330 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1182222.0, ans=0.125 2023-06-24 21:00:44,831 INFO [train.py:996] (0/4) Epoch 7, batch 14100, loss[loss=0.2427, simple_loss=0.3531, pruned_loss=0.06616, over 19773.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2978, pruned_loss=0.0714, over 4265591.37 frames. ], batch size: 702, lr: 4.32e-03, grad_scale: 32.0 2023-06-24 21:01:10,266 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1182402.0, ans=0.125 2023-06-24 21:01:50,964 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1182522.0, ans=0.125 2023-06-24 21:01:52,075 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.852e+02 2.658e+02 3.185e+02 3.657e+02 7.559e+02, threshold=6.369e+02, percent-clipped=1.0 2023-06-24 21:02:05,819 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1182582.0, ans=0.1 2023-06-24 21:02:12,837 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1182642.0, ans=0.125 2023-06-24 21:02:29,764 INFO [train.py:996] (0/4) Epoch 7, batch 14150, loss[loss=0.239, simple_loss=0.3297, pruned_loss=0.07412, over 21609.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.3001, pruned_loss=0.07219, over 4272812.07 frames. ], batch size: 389, lr: 4.32e-03, grad_scale: 16.0 2023-06-24 21:02:32,075 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1182702.0, ans=0.0 2023-06-24 21:03:39,325 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1182882.0, ans=0.2 2023-06-24 21:03:51,442 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1182882.0, ans=0.1 2023-06-24 21:03:53,046 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1182882.0, ans=0.125 2023-06-24 21:04:14,272 INFO [train.py:996] (0/4) Epoch 7, batch 14200, loss[loss=0.2267, simple_loss=0.2922, pruned_loss=0.08061, over 21803.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.2984, pruned_loss=0.07059, over 4267371.88 frames. ], batch size: 371, lr: 4.32e-03, grad_scale: 16.0 2023-06-24 21:04:36,775 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1183062.0, ans=0.2 2023-06-24 21:05:17,838 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.744e+02 2.638e+02 3.052e+02 3.885e+02 7.622e+02, threshold=6.105e+02, percent-clipped=2.0 2023-06-24 21:05:25,260 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 21:06:00,034 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1183242.0, ans=0.0 2023-06-24 21:06:03,298 INFO [train.py:996] (0/4) Epoch 7, batch 14250, loss[loss=0.2381, simple_loss=0.3001, pruned_loss=0.08804, over 21188.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.2924, pruned_loss=0.07014, over 4262819.72 frames. ], batch size: 143, lr: 4.32e-03, grad_scale: 8.0 2023-06-24 21:06:08,306 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.56 vs. limit=22.5 2023-06-24 21:06:48,156 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1183422.0, ans=0.125 2023-06-24 21:07:10,793 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1183482.0, ans=0.2 2023-06-24 21:07:19,752 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1183482.0, ans=0.125 2023-06-24 21:07:32,650 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.19 vs. limit=10.0 2023-06-24 21:07:34,105 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1183542.0, ans=0.125 2023-06-24 21:07:52,515 INFO [train.py:996] (0/4) Epoch 7, batch 14300, loss[loss=0.1999, simple_loss=0.2739, pruned_loss=0.06293, over 21834.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.2967, pruned_loss=0.07101, over 4256956.16 frames. ], batch size: 118, lr: 4.32e-03, grad_scale: 8.0 2023-06-24 21:08:23,976 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.28 vs. limit=15.0 2023-06-24 21:08:56,722 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.111e+02 2.824e+02 3.398e+02 4.914e+02 1.429e+03, threshold=6.796e+02, percent-clipped=17.0 2023-06-24 21:09:27,598 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.85 vs. limit=22.5 2023-06-24 21:09:40,489 INFO [train.py:996] (0/4) Epoch 7, batch 14350, loss[loss=0.1783, simple_loss=0.2512, pruned_loss=0.05275, over 21357.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.3014, pruned_loss=0.07182, over 4249074.17 frames. ], batch size: 131, lr: 4.32e-03, grad_scale: 8.0 2023-06-24 21:11:00,887 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1184082.0, ans=0.0 2023-06-24 21:11:20,600 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1184142.0, ans=0.125 2023-06-24 21:11:25,801 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1184142.0, ans=0.125 2023-06-24 21:11:28,492 INFO [train.py:996] (0/4) Epoch 7, batch 14400, loss[loss=0.1921, simple_loss=0.2617, pruned_loss=0.06128, over 21471.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.2985, pruned_loss=0.07215, over 4260237.42 frames. ], batch size: 212, lr: 4.32e-03, grad_scale: 16.0 2023-06-24 21:11:33,069 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1184202.0, ans=0.0 2023-06-24 21:11:51,473 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1184262.0, ans=0.1 2023-06-24 21:12:09,570 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1184322.0, ans=0.2 2023-06-24 21:12:32,414 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.148e+02 2.835e+02 3.374e+02 4.163e+02 7.231e+02, threshold=6.749e+02, percent-clipped=2.0 2023-06-24 21:12:48,352 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1184382.0, ans=0.1 2023-06-24 21:13:11,614 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1184442.0, ans=0.125 2023-06-24 21:13:14,856 INFO [train.py:996] (0/4) Epoch 7, batch 14450, loss[loss=0.2318, simple_loss=0.2892, pruned_loss=0.08724, over 21335.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2931, pruned_loss=0.07221, over 4264069.80 frames. ], batch size: 144, lr: 4.32e-03, grad_scale: 16.0 2023-06-24 21:13:25,660 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1184502.0, ans=0.0 2023-06-24 21:13:26,440 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.61 vs. limit=15.0 2023-06-24 21:13:41,757 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1184562.0, ans=0.2 2023-06-24 21:15:00,163 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1184742.0, ans=0.125 2023-06-24 21:15:03,206 INFO [train.py:996] (0/4) Epoch 7, batch 14500, loss[loss=0.1951, simple_loss=0.2844, pruned_loss=0.05294, over 21607.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2892, pruned_loss=0.07146, over 4263554.03 frames. ], batch size: 263, lr: 4.32e-03, grad_scale: 16.0 2023-06-24 21:15:43,906 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.08 vs. limit=15.0 2023-06-24 21:15:58,603 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.97 vs. limit=22.5 2023-06-24 21:16:05,722 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 21:16:08,438 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.052e+02 2.747e+02 3.186e+02 4.190e+02 7.871e+02, threshold=6.373e+02, percent-clipped=3.0 2023-06-24 21:16:32,679 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=6.62 vs. limit=22.5 2023-06-24 21:16:47,000 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=5.91 vs. limit=22.5 2023-06-24 21:16:47,461 INFO [train.py:996] (0/4) Epoch 7, batch 14550, loss[loss=0.2393, simple_loss=0.3149, pruned_loss=0.08185, over 21673.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.2941, pruned_loss=0.0728, over 4265016.45 frames. ], batch size: 351, lr: 4.32e-03, grad_scale: 16.0 2023-06-24 21:17:50,684 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=1185222.0, ans=15.0 2023-06-24 21:18:02,585 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1185282.0, ans=0.0 2023-06-24 21:18:28,270 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1185342.0, ans=0.125 2023-06-24 21:18:37,541 INFO [train.py:996] (0/4) Epoch 7, batch 14600, loss[loss=0.1807, simple_loss=0.2297, pruned_loss=0.06582, over 20844.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.302, pruned_loss=0.07677, over 4265298.81 frames. ], batch size: 608, lr: 4.32e-03, grad_scale: 16.0 2023-06-24 21:19:02,682 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.76 vs. limit=15.0 2023-06-24 21:19:26,877 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1185522.0, ans=0.125 2023-06-24 21:19:42,435 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.336e+02 3.108e+02 3.903e+02 5.552e+02 1.166e+03, threshold=7.806e+02, percent-clipped=17.0 2023-06-24 21:19:46,337 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1185582.0, ans=0.0 2023-06-24 21:20:20,970 INFO [train.py:996] (0/4) Epoch 7, batch 14650, loss[loss=0.2671, simple_loss=0.353, pruned_loss=0.09066, over 21237.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3054, pruned_loss=0.07673, over 4268572.05 frames. ], batch size: 548, lr: 4.32e-03, grad_scale: 16.0 2023-06-24 21:21:04,026 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.12 vs. limit=10.0 2023-06-24 21:21:10,360 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1185822.0, ans=0.125 2023-06-24 21:21:36,209 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1185882.0, ans=0.0 2023-06-24 21:21:43,561 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1185942.0, ans=0.0 2023-06-24 21:22:00,328 INFO [train.py:996] (0/4) Epoch 7, batch 14700, loss[loss=0.2999, simple_loss=0.3843, pruned_loss=0.1078, over 21541.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.2991, pruned_loss=0.07095, over 4272931.08 frames. ], batch size: 508, lr: 4.32e-03, grad_scale: 16.0 2023-06-24 21:22:47,566 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1186062.0, ans=0.2 2023-06-24 21:23:06,193 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.798e+02 2.369e+02 2.874e+02 3.417e+02 6.463e+02, threshold=5.748e+02, percent-clipped=0.0 2023-06-24 21:23:25,603 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1186182.0, ans=0.0 2023-06-24 21:23:51,814 INFO [train.py:996] (0/4) Epoch 7, batch 14750, loss[loss=0.2935, simple_loss=0.3571, pruned_loss=0.1149, over 21749.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.3042, pruned_loss=0.07365, over 4267424.23 frames. ], batch size: 441, lr: 4.31e-03, grad_scale: 16.0 2023-06-24 21:24:47,105 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1186422.0, ans=0.0 2023-06-24 21:24:50,594 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1186422.0, ans=0.04949747468305833 2023-06-24 21:24:56,007 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1186482.0, ans=0.2 2023-06-24 21:25:22,302 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1186542.0, ans=0.125 2023-06-24 21:25:48,304 INFO [train.py:996] (0/4) Epoch 7, batch 14800, loss[loss=0.2609, simple_loss=0.3252, pruned_loss=0.0983, over 21585.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3161, pruned_loss=0.07868, over 4270418.60 frames. ], batch size: 414, lr: 4.31e-03, grad_scale: 32.0 2023-06-24 21:25:57,847 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1186602.0, ans=0.1 2023-06-24 21:26:17,907 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1186662.0, ans=0.1 2023-06-24 21:26:22,944 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1186662.0, ans=0.125 2023-06-24 21:26:44,092 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.322e+02 3.322e+02 4.309e+02 5.612e+02 1.041e+03, threshold=8.619e+02, percent-clipped=22.0 2023-06-24 21:26:48,585 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.44 vs. limit=22.5 2023-06-24 21:27:05,983 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1186782.0, ans=0.1 2023-06-24 21:27:44,155 INFO [train.py:996] (0/4) Epoch 7, batch 14850, loss[loss=0.1901, simple_loss=0.2584, pruned_loss=0.06084, over 21068.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3102, pruned_loss=0.07893, over 4273975.25 frames. ], batch size: 143, lr: 4.31e-03, grad_scale: 32.0 2023-06-24 21:28:03,495 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1186962.0, ans=0.125 2023-06-24 21:28:03,511 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 21:28:15,066 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.59 vs. limit=15.0 2023-06-24 21:29:06,009 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.50 vs. limit=15.0 2023-06-24 21:29:23,360 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1187142.0, ans=0.125 2023-06-24 21:29:30,030 INFO [train.py:996] (0/4) Epoch 7, batch 14900, loss[loss=0.187, simple_loss=0.2459, pruned_loss=0.06406, over 20726.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3124, pruned_loss=0.08017, over 4275741.62 frames. ], batch size: 607, lr: 4.31e-03, grad_scale: 16.0 2023-06-24 21:29:57,011 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1187262.0, ans=0.0 2023-06-24 21:29:58,977 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1187262.0, ans=0.125 2023-06-24 21:30:18,269 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1187322.0, ans=0.0 2023-06-24 21:30:36,821 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.129e+02 3.164e+02 3.884e+02 4.869e+02 8.267e+02, threshold=7.767e+02, percent-clipped=0.0 2023-06-24 21:31:14,019 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1187442.0, ans=0.07 2023-06-24 21:31:20,167 INFO [train.py:996] (0/4) Epoch 7, batch 14950, loss[loss=0.2482, simple_loss=0.3303, pruned_loss=0.08303, over 21403.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3123, pruned_loss=0.07863, over 4274343.22 frames. ], batch size: 471, lr: 4.31e-03, grad_scale: 16.0 2023-06-24 21:31:29,509 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1187502.0, ans=0.125 2023-06-24 21:31:57,874 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.16 vs. limit=15.0 2023-06-24 21:32:30,609 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.43 vs. limit=15.0 2023-06-24 21:32:48,917 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1187682.0, ans=0.025 2023-06-24 21:32:52,180 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1187742.0, ans=0.125 2023-06-24 21:32:52,695 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.28 vs. limit=15.0 2023-06-24 21:33:09,311 INFO [train.py:996] (0/4) Epoch 7, batch 15000, loss[loss=0.2235, simple_loss=0.301, pruned_loss=0.07305, over 21812.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3141, pruned_loss=0.07959, over 4280033.04 frames. ], batch size: 351, lr: 4.31e-03, grad_scale: 16.0 2023-06-24 21:33:09,313 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-24 21:33:26,467 INFO [train.py:1028] (0/4) Epoch 7, validation: loss=0.2547, simple_loss=0.3504, pruned_loss=0.07951, over 1796401.00 frames. 2023-06-24 21:33:26,469 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23616MB 2023-06-24 21:33:45,155 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1187802.0, ans=0.04949747468305833 2023-06-24 21:33:50,493 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1187862.0, ans=0.1 2023-06-24 21:34:23,936 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1187922.0, ans=0.125 2023-06-24 21:34:39,132 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.086e+02 2.767e+02 3.159e+02 3.696e+02 5.819e+02, threshold=6.318e+02, percent-clipped=0.0 2023-06-24 21:34:52,505 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1187982.0, ans=0.125 2023-06-24 21:35:05,123 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1188042.0, ans=0.125 2023-06-24 21:35:09,080 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1188042.0, ans=0.125 2023-06-24 21:35:16,583 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1188102.0, ans=0.125 2023-06-24 21:35:17,630 INFO [train.py:996] (0/4) Epoch 7, batch 15050, loss[loss=0.2353, simple_loss=0.3171, pruned_loss=0.07678, over 21664.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3128, pruned_loss=0.07973, over 4276022.62 frames. ], batch size: 263, lr: 4.31e-03, grad_scale: 16.0 2023-06-24 21:35:56,788 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1188162.0, ans=0.0 2023-06-24 21:35:59,097 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.05 vs. limit=10.0 2023-06-24 21:35:59,120 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.18 vs. limit=22.5 2023-06-24 21:36:30,598 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1188282.0, ans=0.125 2023-06-24 21:36:32,404 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 21:36:39,927 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.42 vs. limit=15.0 2023-06-24 21:36:55,635 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.02 vs. limit=15.0 2023-06-24 21:36:58,994 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1188342.0, ans=0.0 2023-06-24 21:37:00,561 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1188402.0, ans=0.1 2023-06-24 21:37:07,750 INFO [train.py:996] (0/4) Epoch 7, batch 15100, loss[loss=0.2642, simple_loss=0.3485, pruned_loss=0.0899, over 21565.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3155, pruned_loss=0.07948, over 4278468.41 frames. ], batch size: 414, lr: 4.31e-03, grad_scale: 16.0 2023-06-24 21:37:18,422 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1188402.0, ans=0.1 2023-06-24 21:37:42,938 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1188462.0, ans=0.125 2023-06-24 21:37:53,346 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1188462.0, ans=0.2 2023-06-24 21:37:53,487 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1188462.0, ans=0.0 2023-06-24 21:38:13,588 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.271e+02 2.973e+02 3.589e+02 4.717e+02 7.835e+02, threshold=7.177e+02, percent-clipped=5.0 2023-06-24 21:38:39,985 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1188642.0, ans=0.125 2023-06-24 21:38:42,716 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.16 vs. limit=22.5 2023-06-24 21:39:00,094 INFO [train.py:996] (0/4) Epoch 7, batch 15150, loss[loss=0.1947, simple_loss=0.259, pruned_loss=0.06522, over 21559.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3125, pruned_loss=0.07968, over 4273908.00 frames. ], batch size: 263, lr: 4.31e-03, grad_scale: 16.0 2023-06-24 21:39:32,490 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1188762.0, ans=0.0 2023-06-24 21:39:53,521 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1188822.0, ans=0.125 2023-06-24 21:40:02,314 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1188882.0, ans=0.125 2023-06-24 21:40:10,109 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.61 vs. limit=15.0 2023-06-24 21:40:25,489 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1188942.0, ans=0.0 2023-06-24 21:40:40,228 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1188942.0, ans=0.2 2023-06-24 21:40:49,643 INFO [train.py:996] (0/4) Epoch 7, batch 15200, loss[loss=0.1953, simple_loss=0.2838, pruned_loss=0.05344, over 21229.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.3032, pruned_loss=0.07576, over 4273423.18 frames. ], batch size: 549, lr: 4.31e-03, grad_scale: 32.0 2023-06-24 21:41:51,474 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.032e+02 2.555e+02 2.882e+02 3.442e+02 5.882e+02, threshold=5.763e+02, percent-clipped=0.0 2023-06-24 21:41:57,491 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1189182.0, ans=0.125 2023-06-24 21:42:30,924 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1189242.0, ans=0.125 2023-06-24 21:42:49,801 INFO [train.py:996] (0/4) Epoch 7, batch 15250, loss[loss=0.262, simple_loss=0.3837, pruned_loss=0.07013, over 19718.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.2984, pruned_loss=0.07409, over 4254935.92 frames. ], batch size: 702, lr: 4.31e-03, grad_scale: 32.0 2023-06-24 21:44:37,273 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1189542.0, ans=0.1 2023-06-24 21:44:40,023 INFO [train.py:996] (0/4) Epoch 7, batch 15300, loss[loss=0.1787, simple_loss=0.2295, pruned_loss=0.06397, over 20724.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.2995, pruned_loss=0.07651, over 4257696.39 frames. ], batch size: 609, lr: 4.31e-03, grad_scale: 16.0 2023-06-24 21:44:47,040 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1189602.0, ans=0.1 2023-06-24 21:44:47,198 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1189602.0, ans=0.0 2023-06-24 21:44:51,180 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1189602.0, ans=0.125 2023-06-24 21:45:00,378 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.17 vs. limit=15.0 2023-06-24 21:45:06,746 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1189662.0, ans=0.0 2023-06-24 21:45:37,484 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.380e+02 3.236e+02 3.827e+02 4.813e+02 8.149e+02, threshold=7.653e+02, percent-clipped=14.0 2023-06-24 21:45:45,165 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1189782.0, ans=0.2 2023-06-24 21:45:46,838 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1189782.0, ans=0.125 2023-06-24 21:45:46,896 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1189782.0, ans=0.125 2023-06-24 21:46:10,011 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.12 vs. limit=10.0 2023-06-24 21:46:18,425 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1189842.0, ans=0.125 2023-06-24 21:46:27,875 INFO [train.py:996] (0/4) Epoch 7, batch 15350, loss[loss=0.188, simple_loss=0.3055, pruned_loss=0.03524, over 19863.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.3041, pruned_loss=0.07873, over 4267027.25 frames. ], batch size: 703, lr: 4.31e-03, grad_scale: 16.0 2023-06-24 21:46:28,480 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1189902.0, ans=0.125 2023-06-24 21:46:46,162 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.01 vs. limit=12.0 2023-06-24 21:46:56,239 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.65 vs. limit=22.5 2023-06-24 21:48:14,124 INFO [train.py:996] (0/4) Epoch 7, batch 15400, loss[loss=0.2455, simple_loss=0.3121, pruned_loss=0.08942, over 21812.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.304, pruned_loss=0.07746, over 4267030.46 frames. ], batch size: 441, lr: 4.31e-03, grad_scale: 16.0 2023-06-24 21:48:50,362 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1190322.0, ans=0.0 2023-06-24 21:48:50,404 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1190322.0, ans=0.1 2023-06-24 21:48:55,741 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.92 vs. limit=15.0 2023-06-24 21:49:05,374 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.161e+02 2.624e+02 3.015e+02 3.662e+02 6.507e+02, threshold=6.030e+02, percent-clipped=0.0 2023-06-24 21:50:02,511 INFO [train.py:996] (0/4) Epoch 7, batch 15450, loss[loss=0.2062, simple_loss=0.2746, pruned_loss=0.0689, over 21152.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3021, pruned_loss=0.07704, over 4267518.33 frames. ], batch size: 608, lr: 4.31e-03, grad_scale: 16.0 2023-06-24 21:50:12,113 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.00 vs. limit=22.5 2023-06-24 21:50:30,813 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1190562.0, ans=0.0 2023-06-24 21:51:00,886 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1190682.0, ans=0.125 2023-06-24 21:51:31,858 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1190742.0, ans=0.2 2023-06-24 21:51:49,864 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1190742.0, ans=0.125 2023-06-24 21:51:51,329 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1190802.0, ans=0.0 2023-06-24 21:51:52,839 INFO [train.py:996] (0/4) Epoch 7, batch 15500, loss[loss=0.2448, simple_loss=0.3201, pruned_loss=0.08471, over 21693.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3049, pruned_loss=0.07663, over 4268765.58 frames. ], batch size: 351, lr: 4.31e-03, grad_scale: 16.0 2023-06-24 21:51:56,595 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1190802.0, ans=0.125 2023-06-24 21:52:03,998 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1190802.0, ans=0.0 2023-06-24 21:52:16,951 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1190862.0, ans=0.1 2023-06-24 21:52:34,804 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1190922.0, ans=0.0 2023-06-24 21:52:51,802 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.076e+02 2.878e+02 3.263e+02 4.056e+02 7.756e+02, threshold=6.526e+02, percent-clipped=2.0 2023-06-24 21:53:37,043 INFO [train.py:996] (0/4) Epoch 7, batch 15550, loss[loss=0.1774, simple_loss=0.247, pruned_loss=0.05396, over 21906.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.3043, pruned_loss=0.07401, over 4272496.87 frames. ], batch size: 98, lr: 4.31e-03, grad_scale: 16.0 2023-06-24 21:53:40,992 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1191102.0, ans=0.0 2023-06-24 21:53:58,740 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1191162.0, ans=0.0 2023-06-24 21:54:27,309 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1191222.0, ans=0.125 2023-06-24 21:55:05,436 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1191342.0, ans=0.125 2023-06-24 21:55:20,538 INFO [train.py:996] (0/4) Epoch 7, batch 15600, loss[loss=0.2086, simple_loss=0.286, pruned_loss=0.06563, over 21769.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.2979, pruned_loss=0.07257, over 4267015.29 frames. ], batch size: 351, lr: 4.31e-03, grad_scale: 32.0 2023-06-24 21:55:21,223 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 21:55:25,099 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1191402.0, ans=0.125 2023-06-24 21:55:40,501 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1191462.0, ans=0.1 2023-06-24 21:55:47,360 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1191462.0, ans=0.0 2023-06-24 21:56:23,746 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.070e+02 2.726e+02 3.210e+02 4.134e+02 7.598e+02, threshold=6.420e+02, percent-clipped=3.0 2023-06-24 21:56:46,527 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1191582.0, ans=0.2 2023-06-24 21:56:51,690 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1191642.0, ans=0.125 2023-06-24 21:57:04,361 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1191642.0, ans=0.0 2023-06-24 21:57:09,406 INFO [train.py:996] (0/4) Epoch 7, batch 15650, loss[loss=0.2223, simple_loss=0.2764, pruned_loss=0.08409, over 21348.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2967, pruned_loss=0.07201, over 4266971.11 frames. ], batch size: 160, lr: 4.30e-03, grad_scale: 32.0 2023-06-24 21:57:18,983 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.50 vs. limit=22.5 2023-06-24 21:58:30,015 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1191882.0, ans=0.0 2023-06-24 21:58:34,746 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1191882.0, ans=0.125 2023-06-24 21:58:50,990 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.70 vs. limit=12.0 2023-06-24 21:58:57,024 INFO [train.py:996] (0/4) Epoch 7, batch 15700, loss[loss=0.178, simple_loss=0.2516, pruned_loss=0.05217, over 21508.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2932, pruned_loss=0.07111, over 4266217.39 frames. ], batch size: 230, lr: 4.30e-03, grad_scale: 32.0 2023-06-24 21:59:22,985 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1192062.0, ans=0.0 2023-06-24 21:59:50,900 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.86 vs. limit=22.5 2023-06-24 22:00:00,280 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.170e+02 2.614e+02 3.168e+02 3.646e+02 5.632e+02, threshold=6.336e+02, percent-clipped=0.0 2023-06-24 22:00:28,660 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1192242.0, ans=0.0 2023-06-24 22:00:43,473 INFO [train.py:996] (0/4) Epoch 7, batch 15750, loss[loss=0.2179, simple_loss=0.2812, pruned_loss=0.07733, over 21273.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2887, pruned_loss=0.07086, over 4261758.50 frames. ], batch size: 471, lr: 4.30e-03, grad_scale: 16.0 2023-06-24 22:00:45,917 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1192302.0, ans=0.1 2023-06-24 22:01:04,662 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 22:01:30,643 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=10.82 vs. limit=15.0 2023-06-24 22:02:04,879 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1192482.0, ans=0.2 2023-06-24 22:02:06,989 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1192482.0, ans=0.05 2023-06-24 22:02:20,988 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1192542.0, ans=0.125 2023-06-24 22:02:32,354 INFO [train.py:996] (0/4) Epoch 7, batch 15800, loss[loss=0.196, simple_loss=0.2665, pruned_loss=0.06276, over 21603.00 frames. ], tot_loss[loss=0.2125, simple_loss=0.2845, pruned_loss=0.07027, over 4257044.79 frames. ], batch size: 263, lr: 4.30e-03, grad_scale: 16.0 2023-06-24 22:02:52,055 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1192662.0, ans=0.125 2023-06-24 22:03:03,272 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1192662.0, ans=0.125 2023-06-24 22:03:37,346 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.306e+02 2.697e+02 3.086e+02 3.699e+02 6.270e+02, threshold=6.172e+02, percent-clipped=0.0 2023-06-24 22:04:15,604 INFO [train.py:996] (0/4) Epoch 7, batch 15850, loss[loss=0.2477, simple_loss=0.3061, pruned_loss=0.09467, over 21197.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2869, pruned_loss=0.07269, over 4256031.47 frames. ], batch size: 143, lr: 4.30e-03, grad_scale: 16.0 2023-06-24 22:04:37,215 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1192962.0, ans=0.125 2023-06-24 22:05:38,161 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1193082.0, ans=0.0 2023-06-24 22:06:01,265 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1193142.0, ans=0.2 2023-06-24 22:06:01,321 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1193142.0, ans=0.1 2023-06-24 22:06:04,582 INFO [train.py:996] (0/4) Epoch 7, batch 15900, loss[loss=0.2041, simple_loss=0.2654, pruned_loss=0.07135, over 21436.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2852, pruned_loss=0.07277, over 4265166.12 frames. ], batch size: 389, lr: 4.30e-03, grad_scale: 16.0 2023-06-24 22:06:07,343 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.51 vs. limit=15.0 2023-06-24 22:06:13,860 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1193202.0, ans=0.125 2023-06-24 22:06:57,846 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1193322.0, ans=0.125 2023-06-24 22:07:09,485 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.367e+02 2.996e+02 3.520e+02 4.315e+02 6.246e+02, threshold=7.040e+02, percent-clipped=3.0 2023-06-24 22:07:42,850 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1193442.0, ans=0.125 2023-06-24 22:07:43,419 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.80 vs. limit=12.0 2023-06-24 22:07:50,964 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=1193442.0, ans=15.0 2023-06-24 22:07:53,076 INFO [train.py:996] (0/4) Epoch 7, batch 15950, loss[loss=0.1883, simple_loss=0.2878, pruned_loss=0.04446, over 21710.00 frames. ], tot_loss[loss=0.213, simple_loss=0.2856, pruned_loss=0.07021, over 4266030.02 frames. ], batch size: 247, lr: 4.30e-03, grad_scale: 16.0 2023-06-24 22:07:53,582 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1193502.0, ans=0.1 2023-06-24 22:08:00,763 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1193502.0, ans=0.125 2023-06-24 22:08:10,995 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1193562.0, ans=0.1 2023-06-24 22:08:15,407 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.88 vs. limit=15.0 2023-06-24 22:08:16,591 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1193562.0, ans=0.0 2023-06-24 22:09:26,534 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.62 vs. limit=12.0 2023-06-24 22:09:43,001 INFO [train.py:996] (0/4) Epoch 7, batch 16000, loss[loss=0.1817, simple_loss=0.2819, pruned_loss=0.04073, over 20934.00 frames. ], tot_loss[loss=0.2125, simple_loss=0.2881, pruned_loss=0.06845, over 4257531.96 frames. ], batch size: 607, lr: 4.30e-03, grad_scale: 32.0 2023-06-24 22:10:00,285 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1193862.0, ans=0.0 2023-06-24 22:10:00,745 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.86 vs. limit=6.0 2023-06-24 22:10:55,226 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.046e+02 3.010e+02 3.950e+02 5.010e+02 9.750e+02, threshold=7.899e+02, percent-clipped=10.0 2023-06-24 22:10:56,156 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1193982.0, ans=0.2 2023-06-24 22:11:32,509 INFO [train.py:996] (0/4) Epoch 7, batch 16050, loss[loss=0.1878, simple_loss=0.2967, pruned_loss=0.03942, over 20844.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2919, pruned_loss=0.06724, over 4259278.00 frames. ], batch size: 608, lr: 4.30e-03, grad_scale: 16.0 2023-06-24 22:11:34,499 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1194102.0, ans=0.125 2023-06-24 22:12:15,302 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.84 vs. limit=15.0 2023-06-24 22:13:00,233 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1194282.0, ans=0.09899494936611666 2023-06-24 22:13:15,738 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1194342.0, ans=0.0 2023-06-24 22:13:20,146 INFO [train.py:996] (0/4) Epoch 7, batch 16100, loss[loss=0.2398, simple_loss=0.3317, pruned_loss=0.07392, over 21685.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.2998, pruned_loss=0.06964, over 4269079.22 frames. ], batch size: 230, lr: 4.30e-03, grad_scale: 16.0 2023-06-24 22:14:16,411 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1194582.0, ans=0.125 2023-06-24 22:14:25,474 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.126e+02 3.065e+02 3.753e+02 4.772e+02 1.110e+03, threshold=7.506e+02, percent-clipped=5.0 2023-06-24 22:14:29,353 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1194582.0, ans=0.125 2023-06-24 22:14:49,488 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.07 vs. limit=6.0 2023-06-24 22:15:05,695 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1194702.0, ans=0.1 2023-06-24 22:15:06,569 INFO [train.py:996] (0/4) Epoch 7, batch 16150, loss[loss=0.22, simple_loss=0.2778, pruned_loss=0.08113, over 21622.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.2984, pruned_loss=0.07167, over 4280812.04 frames. ], batch size: 548, lr: 4.30e-03, grad_scale: 16.0 2023-06-24 22:15:16,247 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1194702.0, ans=0.0 2023-06-24 22:16:09,472 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1194882.0, ans=0.0 2023-06-24 22:16:31,335 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1194882.0, ans=0.125 2023-06-24 22:16:38,088 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1194942.0, ans=0.125 2023-06-24 22:16:52,376 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1194942.0, ans=0.125 2023-06-24 22:16:56,036 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1195002.0, ans=0.2 2023-06-24 22:16:57,004 INFO [train.py:996] (0/4) Epoch 7, batch 16200, loss[loss=0.2846, simple_loss=0.355, pruned_loss=0.1071, over 21847.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.3013, pruned_loss=0.07314, over 4286755.90 frames. ], batch size: 124, lr: 4.30e-03, grad_scale: 16.0 2023-06-24 22:17:06,835 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1195002.0, ans=0.125 2023-06-24 22:17:19,934 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.07 vs. limit=15.0 2023-06-24 22:18:12,668 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1195182.0, ans=0.125 2023-06-24 22:18:15,479 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.476e+02 2.895e+02 3.394e+02 4.172e+02 8.958e+02, threshold=6.788e+02, percent-clipped=2.0 2023-06-24 22:18:16,463 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1195182.0, ans=0.2 2023-06-24 22:18:32,631 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1195242.0, ans=0.125 2023-06-24 22:18:42,925 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1195242.0, ans=0.125 2023-06-24 22:18:47,731 INFO [train.py:996] (0/4) Epoch 7, batch 16250, loss[loss=0.2362, simple_loss=0.3146, pruned_loss=0.07893, over 20734.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.2992, pruned_loss=0.07366, over 4272087.84 frames. ], batch size: 607, lr: 4.30e-03, grad_scale: 16.0 2023-06-24 22:19:12,447 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1195362.0, ans=0.125 2023-06-24 22:19:25,821 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1195362.0, ans=0.0 2023-06-24 22:19:37,826 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.46 vs. limit=15.0 2023-06-24 22:20:07,016 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1195482.0, ans=0.1 2023-06-24 22:20:14,098 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1195542.0, ans=0.125 2023-06-24 22:20:30,261 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=1195602.0, ans=10.0 2023-06-24 22:20:31,150 INFO [train.py:996] (0/4) Epoch 7, batch 16300, loss[loss=0.1761, simple_loss=0.2533, pruned_loss=0.04946, over 21740.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2944, pruned_loss=0.06991, over 4271742.61 frames. ], batch size: 118, lr: 4.30e-03, grad_scale: 16.0 2023-06-24 22:21:24,766 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1195722.0, ans=0.125 2023-06-24 22:21:48,948 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.184e+02 2.667e+02 3.225e+02 3.668e+02 6.965e+02, threshold=6.450e+02, percent-clipped=1.0 2023-06-24 22:22:20,762 INFO [train.py:996] (0/4) Epoch 7, batch 16350, loss[loss=0.2227, simple_loss=0.2984, pruned_loss=0.07348, over 21306.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.2934, pruned_loss=0.07, over 4262365.45 frames. ], batch size: 549, lr: 4.30e-03, grad_scale: 16.0 2023-06-24 22:22:21,490 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1195902.0, ans=0.0 2023-06-24 22:23:26,488 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.41 vs. limit=6.0 2023-06-24 22:24:04,373 INFO [train.py:996] (0/4) Epoch 7, batch 16400, loss[loss=0.2173, simple_loss=0.2965, pruned_loss=0.06905, over 21714.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.2972, pruned_loss=0.07164, over 4262784.60 frames. ], batch size: 389, lr: 4.30e-03, grad_scale: 32.0 2023-06-24 22:24:53,898 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.40 vs. limit=22.5 2023-06-24 22:25:06,076 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.01 vs. limit=10.0 2023-06-24 22:25:16,392 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.060e+02 2.934e+02 3.396e+02 4.473e+02 6.388e+02, threshold=6.793e+02, percent-clipped=0.0 2023-06-24 22:25:17,153 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1196382.0, ans=0.1 2023-06-24 22:25:48,970 INFO [train.py:996] (0/4) Epoch 7, batch 16450, loss[loss=0.2076, simple_loss=0.2857, pruned_loss=0.06469, over 21899.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.2976, pruned_loss=0.07194, over 4264489.02 frames. ], batch size: 351, lr: 4.30e-03, grad_scale: 32.0 2023-06-24 22:25:56,498 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1196502.0, ans=0.125 2023-06-24 22:26:07,372 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1196562.0, ans=0.0 2023-06-24 22:26:23,894 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1196562.0, ans=0.1 2023-06-24 22:27:17,255 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1196742.0, ans=0.0 2023-06-24 22:27:32,602 INFO [train.py:996] (0/4) Epoch 7, batch 16500, loss[loss=0.2075, simple_loss=0.2824, pruned_loss=0.06633, over 21773.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.2968, pruned_loss=0.07267, over 4275333.04 frames. ], batch size: 298, lr: 4.30e-03, grad_scale: 32.0 2023-06-24 22:28:07,346 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1196862.0, ans=0.125 2023-06-24 22:28:37,166 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1196922.0, ans=0.125 2023-06-24 22:28:42,855 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.45 vs. limit=15.0 2023-06-24 22:28:50,000 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1196982.0, ans=0.125 2023-06-24 22:28:51,344 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.322e+02 3.249e+02 4.017e+02 5.671e+02 1.121e+03, threshold=8.034e+02, percent-clipped=17.0 2023-06-24 22:28:51,990 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1196982.0, ans=0.125 2023-06-24 22:28:57,162 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1196982.0, ans=0.0 2023-06-24 22:29:01,452 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1196982.0, ans=0.2 2023-06-24 22:29:03,124 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1196982.0, ans=0.0 2023-06-24 22:29:23,157 INFO [train.py:996] (0/4) Epoch 7, batch 16550, loss[loss=0.1909, simple_loss=0.2626, pruned_loss=0.05964, over 21696.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.2952, pruned_loss=0.07092, over 4279092.46 frames. ], batch size: 263, lr: 4.30e-03, grad_scale: 32.0 2023-06-24 22:30:28,074 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1197222.0, ans=0.07 2023-06-24 22:30:36,067 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff3.min_abs, batch_count=1197222.0, ans=0.2 2023-06-24 22:31:36,590 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1197402.0, ans=0.125 2023-06-24 22:31:37,522 INFO [train.py:996] (0/4) Epoch 7, batch 16600, loss[loss=0.2581, simple_loss=0.3417, pruned_loss=0.08723, over 21256.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.3019, pruned_loss=0.07252, over 4269850.56 frames. ], batch size: 159, lr: 4.29e-03, grad_scale: 16.0 2023-06-24 22:31:43,472 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1197402.0, ans=0.0 2023-06-24 22:31:50,800 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1197402.0, ans=0.2 2023-06-24 22:32:32,040 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1197582.0, ans=0.125 2023-06-24 22:32:36,777 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.639e+02 3.261e+02 4.003e+02 5.335e+02 1.096e+03, threshold=8.006e+02, percent-clipped=4.0 2023-06-24 22:32:53,816 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1197642.0, ans=0.125 2023-06-24 22:33:17,078 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1197642.0, ans=0.125 2023-06-24 22:33:20,923 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 22:33:29,481 INFO [train.py:996] (0/4) Epoch 7, batch 16650, loss[loss=0.2359, simple_loss=0.3399, pruned_loss=0.0659, over 20761.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3137, pruned_loss=0.07656, over 4275512.20 frames. ], batch size: 607, lr: 4.29e-03, grad_scale: 16.0 2023-06-24 22:33:43,291 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1197702.0, ans=0.125 2023-06-24 22:34:15,568 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.41 vs. limit=15.0 2023-06-24 22:35:17,508 INFO [train.py:996] (0/4) Epoch 7, batch 16700, loss[loss=0.1912, simple_loss=0.2534, pruned_loss=0.06445, over 21425.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3139, pruned_loss=0.07789, over 4274313.08 frames. ], batch size: 194, lr: 4.29e-03, grad_scale: 16.0 2023-06-24 22:35:38,494 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.22 vs. limit=15.0 2023-06-24 22:35:42,585 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.96 vs. limit=6.0 2023-06-24 22:36:09,387 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1198122.0, ans=0.125 2023-06-24 22:36:11,131 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1198122.0, ans=0.1 2023-06-24 22:36:38,557 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1198182.0, ans=0.1 2023-06-24 22:36:39,633 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.570e+02 3.449e+02 4.344e+02 5.804e+02 8.392e+02, threshold=8.689e+02, percent-clipped=2.0 2023-06-24 22:37:11,574 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1198302.0, ans=0.0 2023-06-24 22:37:12,526 INFO [train.py:996] (0/4) Epoch 7, batch 16750, loss[loss=0.2686, simple_loss=0.3589, pruned_loss=0.08912, over 21915.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3163, pruned_loss=0.07988, over 4273741.22 frames. ], batch size: 372, lr: 4.29e-03, grad_scale: 16.0 2023-06-24 22:38:04,067 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1198362.0, ans=0.125 2023-06-24 22:38:20,367 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1198422.0, ans=0.125 2023-06-24 22:38:35,531 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1198482.0, ans=0.09899494936611666 2023-06-24 22:39:02,819 INFO [train.py:996] (0/4) Epoch 7, batch 16800, loss[loss=0.2415, simple_loss=0.3749, pruned_loss=0.05406, over 20704.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.3214, pruned_loss=0.0801, over 4268422.49 frames. ], batch size: 607, lr: 4.29e-03, grad_scale: 32.0 2023-06-24 22:40:13,347 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.86 vs. limit=15.0 2023-06-24 22:40:20,485 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.624e+02 3.537e+02 4.384e+02 6.125e+02 1.119e+03, threshold=8.769e+02, percent-clipped=3.0 2023-06-24 22:40:55,239 INFO [train.py:996] (0/4) Epoch 7, batch 16850, loss[loss=0.222, simple_loss=0.3155, pruned_loss=0.06426, over 17271.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3176, pruned_loss=0.07969, over 4272697.98 frames. ], batch size: 60, lr: 4.29e-03, grad_scale: 32.0 2023-06-24 22:41:14,864 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1198902.0, ans=0.05 2023-06-24 22:41:38,642 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1198962.0, ans=0.2 2023-06-24 22:42:47,361 INFO [train.py:996] (0/4) Epoch 7, batch 16900, loss[loss=0.1988, simple_loss=0.2731, pruned_loss=0.06225, over 21624.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3121, pruned_loss=0.07875, over 4273812.20 frames. ], batch size: 391, lr: 4.29e-03, grad_scale: 32.0 2023-06-24 22:42:53,100 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 22:42:58,925 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.79 vs. limit=15.0 2023-06-24 22:43:04,994 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1199202.0, ans=0.0 2023-06-24 22:43:38,209 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.25 vs. limit=15.0 2023-06-24 22:43:54,625 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.184e+02 2.672e+02 3.013e+02 3.696e+02 7.423e+02, threshold=6.025e+02, percent-clipped=0.0 2023-06-24 22:44:00,274 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1199382.0, ans=0.0 2023-06-24 22:44:34,211 INFO [train.py:996] (0/4) Epoch 7, batch 16950, loss[loss=0.2152, simple_loss=0.2873, pruned_loss=0.07155, over 21863.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3041, pruned_loss=0.07637, over 4268138.48 frames. ], batch size: 332, lr: 4.29e-03, grad_scale: 32.0 2023-06-24 22:45:07,802 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1199562.0, ans=0.1 2023-06-24 22:45:31,878 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1199622.0, ans=0.09899494936611666 2023-06-24 22:45:40,634 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.05 vs. limit=22.5 2023-06-24 22:46:01,332 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1199742.0, ans=0.0 2023-06-24 22:46:21,579 INFO [train.py:996] (0/4) Epoch 7, batch 17000, loss[loss=0.2593, simple_loss=0.3193, pruned_loss=0.09963, over 21913.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.301, pruned_loss=0.07675, over 4275140.32 frames. ], batch size: 414, lr: 4.29e-03, grad_scale: 32.0 2023-06-24 22:46:41,037 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.96 vs. limit=15.0 2023-06-24 22:47:29,805 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-200000.pt 2023-06-24 22:47:33,177 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.260e+02 3.127e+02 3.708e+02 4.467e+02 7.774e+02, threshold=7.417e+02, percent-clipped=6.0 2023-06-24 22:48:06,689 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1200042.0, ans=0.125 2023-06-24 22:48:18,097 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.90 vs. limit=15.0 2023-06-24 22:48:18,551 INFO [train.py:996] (0/4) Epoch 7, batch 17050, loss[loss=0.2328, simple_loss=0.3122, pruned_loss=0.07671, over 21443.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3071, pruned_loss=0.07902, over 4282446.40 frames. ], batch size: 211, lr: 4.29e-03, grad_scale: 32.0 2023-06-24 22:49:21,444 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1200282.0, ans=0.1 2023-06-24 22:50:04,898 INFO [train.py:996] (0/4) Epoch 7, batch 17100, loss[loss=0.2104, simple_loss=0.2775, pruned_loss=0.0716, over 21651.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3069, pruned_loss=0.07915, over 4290585.65 frames. ], batch size: 263, lr: 4.29e-03, grad_scale: 32.0 2023-06-24 22:50:06,910 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1200402.0, ans=0.1 2023-06-24 22:50:48,851 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1200522.0, ans=0.125 2023-06-24 22:50:53,680 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1200522.0, ans=0.125 2023-06-24 22:51:07,027 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.328e+02 2.876e+02 3.458e+02 4.009e+02 6.895e+02, threshold=6.917e+02, percent-clipped=0.0 2023-06-24 22:51:46,950 INFO [train.py:996] (0/4) Epoch 7, batch 17150, loss[loss=0.2164, simple_loss=0.2964, pruned_loss=0.06815, over 21558.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.3026, pruned_loss=0.07856, over 4286502.84 frames. ], batch size: 471, lr: 4.29e-03, grad_scale: 32.0 2023-06-24 22:52:01,764 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1200702.0, ans=0.125 2023-06-24 22:52:28,314 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1200822.0, ans=0.1 2023-06-24 22:52:35,131 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1200822.0, ans=0.2 2023-06-24 22:52:35,174 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1200822.0, ans=0.2 2023-06-24 22:53:42,015 INFO [train.py:996] (0/4) Epoch 7, batch 17200, loss[loss=0.253, simple_loss=0.3204, pruned_loss=0.09277, over 21542.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3029, pruned_loss=0.07829, over 4283672.41 frames. ], batch size: 414, lr: 4.29e-03, grad_scale: 32.0 2023-06-24 22:53:48,207 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1201002.0, ans=0.2 2023-06-24 22:54:53,174 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.136e+02 2.812e+02 3.269e+02 4.158e+02 6.698e+02, threshold=6.538e+02, percent-clipped=0.0 2023-06-24 22:55:19,528 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1201242.0, ans=0.05 2023-06-24 22:55:33,465 INFO [train.py:996] (0/4) Epoch 7, batch 17250, loss[loss=0.2346, simple_loss=0.3138, pruned_loss=0.07774, over 21623.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3067, pruned_loss=0.08008, over 4285130.79 frames. ], batch size: 263, lr: 4.29e-03, grad_scale: 32.0 2023-06-24 22:57:19,497 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1201542.0, ans=0.125 2023-06-24 22:57:24,162 INFO [train.py:996] (0/4) Epoch 7, batch 17300, loss[loss=0.2623, simple_loss=0.325, pruned_loss=0.09978, over 21326.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.3137, pruned_loss=0.08273, over 4281308.17 frames. ], batch size: 176, lr: 4.29e-03, grad_scale: 16.0 2023-06-24 22:58:34,574 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.18 vs. limit=22.5 2023-06-24 22:58:47,376 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.589e+02 3.154e+02 3.783e+02 4.784e+02 7.470e+02, threshold=7.566e+02, percent-clipped=5.0 2023-06-24 22:58:53,943 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1201782.0, ans=0.125 2023-06-24 22:59:05,334 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.88 vs. limit=10.0 2023-06-24 22:59:15,009 INFO [train.py:996] (0/4) Epoch 7, batch 17350, loss[loss=0.1653, simple_loss=0.2116, pruned_loss=0.05952, over 17253.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.3152, pruned_loss=0.0826, over 4269052.70 frames. ], batch size: 63, lr: 4.29e-03, grad_scale: 16.0 2023-06-24 23:00:05,544 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.00 vs. limit=15.0 2023-06-24 23:00:31,620 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1202082.0, ans=0.0 2023-06-24 23:00:34,069 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.94 vs. limit=15.0 2023-06-24 23:00:51,764 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.03 vs. limit=12.0 2023-06-24 23:01:04,946 INFO [train.py:996] (0/4) Epoch 7, batch 17400, loss[loss=0.1491, simple_loss=0.1868, pruned_loss=0.05572, over 16789.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3097, pruned_loss=0.07835, over 4260321.09 frames. ], batch size: 60, lr: 4.29e-03, grad_scale: 16.0 2023-06-24 23:01:05,607 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1202202.0, ans=0.125 2023-06-24 23:01:10,254 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.32 vs. limit=12.0 2023-06-24 23:02:25,935 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1202382.0, ans=0.125 2023-06-24 23:02:28,354 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.135e+02 3.058e+02 3.682e+02 4.915e+02 8.567e+02, threshold=7.364e+02, percent-clipped=2.0 2023-06-24 23:02:40,068 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1202442.0, ans=0.125 2023-06-24 23:03:05,441 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.37 vs. limit=15.0 2023-06-24 23:03:05,917 INFO [train.py:996] (0/4) Epoch 7, batch 17450, loss[loss=0.1809, simple_loss=0.2367, pruned_loss=0.06258, over 21823.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.3071, pruned_loss=0.07614, over 4263543.05 frames. ], batch size: 98, lr: 4.29e-03, grad_scale: 16.0 2023-06-24 23:03:08,121 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1202502.0, ans=0.2 2023-06-24 23:03:25,555 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1202502.0, ans=0.125 2023-06-24 23:03:30,647 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1202562.0, ans=0.125 2023-06-24 23:03:57,505 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1202622.0, ans=0.125 2023-06-24 23:04:00,513 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=1202622.0, ans=0.025 2023-06-24 23:04:05,490 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1202622.0, ans=0.125 2023-06-24 23:04:26,492 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1202742.0, ans=0.0 2023-06-24 23:04:59,045 INFO [train.py:996] (0/4) Epoch 7, batch 17500, loss[loss=0.1937, simple_loss=0.2669, pruned_loss=0.0603, over 21029.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.3009, pruned_loss=0.07278, over 4268509.02 frames. ], batch size: 608, lr: 4.29e-03, grad_scale: 16.0 2023-06-24 23:05:09,783 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1202802.0, ans=0.125 2023-06-24 23:05:30,668 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.99 vs. limit=15.0 2023-06-24 23:05:36,671 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 23:05:43,429 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1202922.0, ans=0.0 2023-06-24 23:05:49,416 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.36 vs. limit=15.0 2023-06-24 23:06:04,515 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.111e+02 2.853e+02 3.403e+02 4.672e+02 8.323e+02, threshold=6.806e+02, percent-clipped=1.0 2023-06-24 23:06:44,180 INFO [train.py:996] (0/4) Epoch 7, batch 17550, loss[loss=0.2259, simple_loss=0.3159, pruned_loss=0.06794, over 21742.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.301, pruned_loss=0.07156, over 4273221.04 frames. ], batch size: 124, lr: 4.28e-03, grad_scale: 8.0 2023-06-24 23:07:00,186 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1203162.0, ans=0.125 2023-06-24 23:07:59,829 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1203342.0, ans=0.1 2023-06-24 23:08:32,345 INFO [train.py:996] (0/4) Epoch 7, batch 17600, loss[loss=0.2343, simple_loss=0.296, pruned_loss=0.08626, over 20112.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.3044, pruned_loss=0.07194, over 4262715.69 frames. ], batch size: 703, lr: 4.28e-03, grad_scale: 16.0 2023-06-24 23:08:33,012 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1203402.0, ans=0.025 2023-06-24 23:08:36,705 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1203402.0, ans=0.125 2023-06-24 23:08:43,658 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1203402.0, ans=0.125 2023-06-24 23:09:41,267 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.027e+02 2.778e+02 3.294e+02 4.134e+02 8.304e+02, threshold=6.589e+02, percent-clipped=2.0 2023-06-24 23:09:43,441 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1203582.0, ans=0.2 2023-06-24 23:10:19,647 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1203702.0, ans=0.2 2023-06-24 23:10:20,634 INFO [train.py:996] (0/4) Epoch 7, batch 17650, loss[loss=0.217, simple_loss=0.298, pruned_loss=0.06805, over 21681.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.3023, pruned_loss=0.07248, over 4266618.67 frames. ], batch size: 415, lr: 4.28e-03, grad_scale: 16.0 2023-06-24 23:10:22,738 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1203702.0, ans=0.1 2023-06-24 23:10:51,199 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.47 vs. limit=22.5 2023-06-24 23:11:11,834 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1203822.0, ans=0.125 2023-06-24 23:11:58,088 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1203942.0, ans=10.0 2023-06-24 23:12:12,063 INFO [train.py:996] (0/4) Epoch 7, batch 17700, loss[loss=0.2337, simple_loss=0.3138, pruned_loss=0.07676, over 21441.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.2982, pruned_loss=0.07044, over 4272775.10 frames. ], batch size: 131, lr: 4.28e-03, grad_scale: 16.0 2023-06-24 23:12:27,109 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1204002.0, ans=0.0 2023-06-24 23:12:28,767 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1204002.0, ans=0.125 2023-06-24 23:13:05,823 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1204122.0, ans=0.2 2023-06-24 23:13:30,878 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.242e+02 2.965e+02 3.854e+02 5.323e+02 9.978e+02, threshold=7.709e+02, percent-clipped=16.0 2023-06-24 23:14:06,514 INFO [train.py:996] (0/4) Epoch 7, batch 17750, loss[loss=0.2477, simple_loss=0.3237, pruned_loss=0.08579, over 21335.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.306, pruned_loss=0.07413, over 4272036.48 frames. ], batch size: 176, lr: 4.28e-03, grad_scale: 16.0 2023-06-24 23:14:18,598 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.80 vs. limit=15.0 2023-06-24 23:14:35,285 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1204362.0, ans=0.0 2023-06-24 23:14:37,333 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1204362.0, ans=0.0 2023-06-24 23:15:56,639 INFO [train.py:996] (0/4) Epoch 7, batch 17800, loss[loss=0.2285, simple_loss=0.3095, pruned_loss=0.07379, over 20692.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.3061, pruned_loss=0.07345, over 4261833.20 frames. ], batch size: 609, lr: 4.28e-03, grad_scale: 16.0 2023-06-24 23:16:40,567 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1204722.0, ans=0.0 2023-06-24 23:17:23,013 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.078e+02 2.835e+02 3.431e+02 4.472e+02 1.183e+03, threshold=6.863e+02, percent-clipped=3.0 2023-06-24 23:17:47,982 INFO [train.py:996] (0/4) Epoch 7, batch 17850, loss[loss=0.1683, simple_loss=0.2276, pruned_loss=0.05452, over 21723.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.3055, pruned_loss=0.07349, over 4262961.45 frames. ], batch size: 112, lr: 4.28e-03, grad_scale: 16.0 2023-06-24 23:18:26,867 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1204962.0, ans=0.0 2023-06-24 23:18:28,526 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1204962.0, ans=0.1 2023-06-24 23:18:29,293 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.88 vs. limit=15.0 2023-06-24 23:18:30,824 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1204962.0, ans=0.0 2023-06-24 23:19:08,477 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1205082.0, ans=0.125 2023-06-24 23:19:29,117 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1205142.0, ans=0.125 2023-06-24 23:19:31,036 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1205142.0, ans=0.125 2023-06-24 23:19:38,030 INFO [train.py:996] (0/4) Epoch 7, batch 17900, loss[loss=0.2395, simple_loss=0.3181, pruned_loss=0.08042, over 21285.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3106, pruned_loss=0.07632, over 4267862.82 frames. ], batch size: 159, lr: 4.28e-03, grad_scale: 16.0 2023-06-24 23:20:29,447 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=1205322.0, ans=0.5 2023-06-24 23:20:29,475 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1205322.0, ans=0.0 2023-06-24 23:20:41,227 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1205322.0, ans=0.125 2023-06-24 23:21:03,390 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.271e+02 2.987e+02 3.415e+02 4.264e+02 7.391e+02, threshold=6.831e+02, percent-clipped=3.0 2023-06-24 23:21:25,942 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.46 vs. limit=22.5 2023-06-24 23:21:27,890 INFO [train.py:996] (0/4) Epoch 7, batch 17950, loss[loss=0.2123, simple_loss=0.3039, pruned_loss=0.06031, over 21639.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.3125, pruned_loss=0.07383, over 4270977.50 frames. ], batch size: 389, lr: 4.28e-03, grad_scale: 16.0 2023-06-24 23:21:53,793 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1205562.0, ans=0.0 2023-06-24 23:22:16,102 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.10 vs. limit=15.0 2023-06-24 23:22:23,255 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.08 vs. limit=15.0 2023-06-24 23:23:10,703 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1205742.0, ans=0.2 2023-06-24 23:23:11,184 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=1205742.0, ans=15.0 2023-06-24 23:23:19,952 INFO [train.py:996] (0/4) Epoch 7, batch 18000, loss[loss=0.1924, simple_loss=0.2576, pruned_loss=0.06359, over 21608.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.3047, pruned_loss=0.07206, over 4268393.62 frames. ], batch size: 247, lr: 4.28e-03, grad_scale: 32.0 2023-06-24 23:23:19,953 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-24 23:23:40,283 INFO [train.py:1028] (0/4) Epoch 7, validation: loss=0.2616, simple_loss=0.3599, pruned_loss=0.08162, over 1796401.00 frames. 2023-06-24 23:23:40,284 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23616MB 2023-06-24 23:23:41,865 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.41 vs. limit=15.0 2023-06-24 23:24:11,657 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1205862.0, ans=0.125 2023-06-24 23:24:52,040 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1205982.0, ans=0.125 2023-06-24 23:24:55,039 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.910e+02 2.947e+02 3.493e+02 4.464e+02 9.866e+02, threshold=6.986e+02, percent-clipped=5.0 2023-06-24 23:25:35,633 INFO [train.py:996] (0/4) Epoch 7, batch 18050, loss[loss=0.2106, simple_loss=0.283, pruned_loss=0.06909, over 21666.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.2972, pruned_loss=0.07072, over 4271261.27 frames. ], batch size: 298, lr: 4.28e-03, grad_scale: 32.0 2023-06-24 23:25:36,194 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1206102.0, ans=0.2 2023-06-24 23:26:15,920 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 23:27:02,199 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1206342.0, ans=10.0 2023-06-24 23:27:03,925 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1206342.0, ans=0.125 2023-06-24 23:27:32,136 INFO [train.py:996] (0/4) Epoch 7, batch 18100, loss[loss=0.2474, simple_loss=0.3165, pruned_loss=0.0891, over 21267.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.302, pruned_loss=0.07338, over 4277660.57 frames. ], batch size: 176, lr: 4.28e-03, grad_scale: 32.0 2023-06-24 23:27:35,803 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1206402.0, ans=0.125 2023-06-24 23:28:23,913 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.79 vs. limit=22.5 2023-06-24 23:28:44,816 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.264e+02 2.881e+02 3.345e+02 4.009e+02 7.924e+02, threshold=6.690e+02, percent-clipped=2.0 2023-06-24 23:28:55,465 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1206642.0, ans=0.0 2023-06-24 23:29:09,681 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1206642.0, ans=0.1 2023-06-24 23:29:14,113 INFO [train.py:996] (0/4) Epoch 7, batch 18150, loss[loss=0.2157, simple_loss=0.2928, pruned_loss=0.06935, over 21781.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.3025, pruned_loss=0.07235, over 4282845.98 frames. ], batch size: 317, lr: 4.28e-03, grad_scale: 32.0 2023-06-24 23:29:31,837 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1206702.0, ans=0.2 2023-06-24 23:29:55,927 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1206822.0, ans=0.125 2023-06-24 23:30:07,926 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=1206822.0, ans=10.0 2023-06-24 23:30:36,151 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1206942.0, ans=0.125 2023-06-24 23:30:39,630 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1206942.0, ans=0.125 2023-06-24 23:30:59,297 INFO [train.py:996] (0/4) Epoch 7, batch 18200, loss[loss=0.2166, simple_loss=0.2802, pruned_loss=0.07647, over 21596.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.297, pruned_loss=0.07243, over 4288728.07 frames. ], batch size: 415, lr: 4.28e-03, grad_scale: 32.0 2023-06-24 23:31:33,638 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1207062.0, ans=0.0 2023-06-24 23:31:53,376 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1207182.0, ans=0.0 2023-06-24 23:32:05,006 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.956e+02 2.874e+02 3.635e+02 5.188e+02 1.150e+03, threshold=7.270e+02, percent-clipped=9.0 2023-06-24 23:32:38,591 INFO [train.py:996] (0/4) Epoch 7, batch 18250, loss[loss=0.19, simple_loss=0.2658, pruned_loss=0.05713, over 21465.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2886, pruned_loss=0.06909, over 4271339.25 frames. ], batch size: 131, lr: 4.28e-03, grad_scale: 32.0 2023-06-24 23:32:56,304 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1207302.0, ans=0.1 2023-06-24 23:33:46,392 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.46 vs. limit=15.0 2023-06-24 23:33:58,097 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1207542.0, ans=0.2 2023-06-24 23:34:24,256 INFO [train.py:996] (0/4) Epoch 7, batch 18300, loss[loss=0.2471, simple_loss=0.3554, pruned_loss=0.06935, over 21833.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.2894, pruned_loss=0.07001, over 4271933.94 frames. ], batch size: 371, lr: 4.28e-03, grad_scale: 16.0 2023-06-24 23:34:34,647 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.23 vs. limit=12.0 2023-06-24 23:34:48,553 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.21 vs. limit=10.0 2023-06-24 23:34:57,052 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1207662.0, ans=0.2 2023-06-24 23:35:39,174 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.52 vs. limit=6.0 2023-06-24 23:35:39,691 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.046e+02 2.910e+02 3.541e+02 4.206e+02 1.059e+03, threshold=7.082e+02, percent-clipped=3.0 2023-06-24 23:36:12,357 INFO [train.py:996] (0/4) Epoch 7, batch 18350, loss[loss=0.1485, simple_loss=0.2259, pruned_loss=0.03552, over 16909.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2955, pruned_loss=0.06952, over 4261306.06 frames. ], batch size: 63, lr: 4.28e-03, grad_scale: 16.0 2023-06-24 23:36:39,711 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1207962.0, ans=0.125 2023-06-24 23:36:42,967 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1207962.0, ans=0.2 2023-06-24 23:37:25,666 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1208082.0, ans=0.0 2023-06-24 23:37:52,925 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1208142.0, ans=0.125 2023-06-24 23:38:01,050 INFO [train.py:996] (0/4) Epoch 7, batch 18400, loss[loss=0.1966, simple_loss=0.28, pruned_loss=0.05658, over 21618.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2914, pruned_loss=0.06818, over 4256600.54 frames. ], batch size: 414, lr: 4.28e-03, grad_scale: 32.0 2023-06-24 23:38:25,151 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1208262.0, ans=0.0 2023-06-24 23:38:43,424 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1208262.0, ans=0.0 2023-06-24 23:38:59,965 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1208322.0, ans=0.0 2023-06-24 23:39:08,837 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1208382.0, ans=0.0 2023-06-24 23:39:08,885 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1208382.0, ans=0.1 2023-06-24 23:39:16,999 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.081e+02 2.559e+02 3.009e+02 3.655e+02 5.951e+02, threshold=6.019e+02, percent-clipped=0.0 2023-06-24 23:39:29,988 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1208442.0, ans=0.125 2023-06-24 23:39:46,180 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1208442.0, ans=0.1 2023-06-24 23:39:49,320 INFO [train.py:996] (0/4) Epoch 7, batch 18450, loss[loss=0.1883, simple_loss=0.2849, pruned_loss=0.04588, over 21575.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.2882, pruned_loss=0.06512, over 4254590.30 frames. ], batch size: 442, lr: 4.27e-03, grad_scale: 32.0 2023-06-24 23:39:51,479 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1208502.0, ans=0.025 2023-06-24 23:40:45,806 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1208622.0, ans=0.1 2023-06-24 23:41:01,790 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1208682.0, ans=0.025 2023-06-24 23:41:03,449 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1208682.0, ans=0.0 2023-06-24 23:41:29,524 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1208742.0, ans=0.1 2023-06-24 23:41:38,250 INFO [train.py:996] (0/4) Epoch 7, batch 18500, loss[loss=0.1831, simple_loss=0.2511, pruned_loss=0.05757, over 21328.00 frames. ], tot_loss[loss=0.2071, simple_loss=0.2853, pruned_loss=0.06442, over 4250997.31 frames. ], batch size: 211, lr: 4.27e-03, grad_scale: 32.0 2023-06-24 23:42:36,479 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1208922.0, ans=0.0 2023-06-24 23:42:45,384 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1208982.0, ans=0.125 2023-06-24 23:42:59,436 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.887e+02 2.868e+02 3.588e+02 5.410e+02 1.340e+03, threshold=7.175e+02, percent-clipped=18.0 2023-06-24 23:43:24,452 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1209102.0, ans=0.07 2023-06-24 23:43:25,450 INFO [train.py:996] (0/4) Epoch 7, batch 18550, loss[loss=0.1816, simple_loss=0.2529, pruned_loss=0.05513, over 21373.00 frames. ], tot_loss[loss=0.2052, simple_loss=0.2839, pruned_loss=0.06324, over 4248648.00 frames. ], batch size: 194, lr: 4.27e-03, grad_scale: 16.0 2023-06-24 23:43:36,959 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1209102.0, ans=0.2 2023-06-24 23:43:42,569 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.42 vs. limit=15.0 2023-06-24 23:43:46,006 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.95 vs. limit=15.0 2023-06-24 23:43:58,325 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1209162.0, ans=0.0 2023-06-24 23:44:24,551 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1209222.0, ans=0.1 2023-06-24 23:44:38,554 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1209282.0, ans=0.125 2023-06-24 23:45:12,453 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1209402.0, ans=0.125 2023-06-24 23:45:13,358 INFO [train.py:996] (0/4) Epoch 7, batch 18600, loss[loss=0.2274, simple_loss=0.3127, pruned_loss=0.07107, over 21789.00 frames. ], tot_loss[loss=0.2051, simple_loss=0.2819, pruned_loss=0.06419, over 4235837.86 frames. ], batch size: 371, lr: 4.27e-03, grad_scale: 16.0 2023-06-24 23:45:55,429 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.28 vs. limit=15.0 2023-06-24 23:46:08,700 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1209522.0, ans=0.0 2023-06-24 23:46:24,059 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1209582.0, ans=0.125 2023-06-24 23:46:35,141 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.056e+02 2.703e+02 3.435e+02 4.233e+02 7.811e+02, threshold=6.869e+02, percent-clipped=3.0 2023-06-24 23:46:42,079 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.03 vs. limit=15.0 2023-06-24 23:46:53,022 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1209642.0, ans=0.2 2023-06-24 23:47:01,131 INFO [train.py:996] (0/4) Epoch 7, batch 18650, loss[loss=0.1999, simple_loss=0.2695, pruned_loss=0.06515, over 15106.00 frames. ], tot_loss[loss=0.2058, simple_loss=0.2817, pruned_loss=0.06498, over 4224399.45 frames. ], batch size: 60, lr: 4.27e-03, grad_scale: 16.0 2023-06-24 23:47:36,993 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1209762.0, ans=0.125 2023-06-24 23:48:15,237 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1209882.0, ans=0.1 2023-06-24 23:48:17,050 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1209882.0, ans=0.0 2023-06-24 23:48:48,674 INFO [train.py:996] (0/4) Epoch 7, batch 18700, loss[loss=0.1978, simple_loss=0.2656, pruned_loss=0.06503, over 21812.00 frames. ], tot_loss[loss=0.2056, simple_loss=0.279, pruned_loss=0.06611, over 4240049.94 frames. ], batch size: 316, lr: 4.27e-03, grad_scale: 16.0 2023-06-24 23:49:52,685 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.45 vs. limit=6.0 2023-06-24 23:50:10,830 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.160e+02 2.803e+02 3.350e+02 3.905e+02 5.845e+02, threshold=6.700e+02, percent-clipped=0.0 2023-06-24 23:50:32,136 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1210242.0, ans=0.07 2023-06-24 23:50:36,537 INFO [train.py:996] (0/4) Epoch 7, batch 18750, loss[loss=0.2132, simple_loss=0.2798, pruned_loss=0.07329, over 21300.00 frames. ], tot_loss[loss=0.209, simple_loss=0.2814, pruned_loss=0.0683, over 4254080.27 frames. ], batch size: 176, lr: 4.27e-03, grad_scale: 16.0 2023-06-24 23:50:38,998 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1210302.0, ans=0.1 2023-06-24 23:51:01,035 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1210362.0, ans=0.1 2023-06-24 23:51:36,886 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.39 vs. limit=15.0 2023-06-24 23:52:03,385 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1210542.0, ans=0.125 2023-06-24 23:52:22,877 INFO [train.py:996] (0/4) Epoch 7, batch 18800, loss[loss=0.1966, simple_loss=0.2881, pruned_loss=0.05251, over 21786.00 frames. ], tot_loss[loss=0.213, simple_loss=0.2865, pruned_loss=0.06972, over 4252734.93 frames. ], batch size: 282, lr: 4.27e-03, grad_scale: 32.0 2023-06-24 23:52:37,408 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1210602.0, ans=0.125 2023-06-24 23:52:49,539 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1210662.0, ans=0.125 2023-06-24 23:53:10,968 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.92 vs. limit=15.0 2023-06-24 23:53:43,559 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.736e+02 2.643e+02 3.373e+02 4.457e+02 8.790e+02, threshold=6.746e+02, percent-clipped=4.0 2023-06-24 23:54:09,238 INFO [train.py:996] (0/4) Epoch 7, batch 18850, loss[loss=0.1726, simple_loss=0.2596, pruned_loss=0.04284, over 21623.00 frames. ], tot_loss[loss=0.2062, simple_loss=0.2817, pruned_loss=0.06533, over 4249239.49 frames. ], batch size: 263, lr: 4.27e-03, grad_scale: 32.0 2023-06-24 23:54:15,339 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.22 vs. limit=15.0 2023-06-24 23:54:40,327 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1210962.0, ans=0.1 2023-06-24 23:55:00,920 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.85 vs. limit=12.0 2023-06-24 23:55:46,133 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1211142.0, ans=0.2 2023-06-24 23:55:56,268 INFO [train.py:996] (0/4) Epoch 7, batch 18900, loss[loss=0.1945, simple_loss=0.2638, pruned_loss=0.06256, over 21611.00 frames. ], tot_loss[loss=0.2049, simple_loss=0.2793, pruned_loss=0.06524, over 4256029.39 frames. ], batch size: 298, lr: 4.27e-03, grad_scale: 32.0 2023-06-24 23:55:56,861 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1211202.0, ans=0.0 2023-06-24 23:56:08,460 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1211202.0, ans=0.0 2023-06-24 23:56:17,386 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1211262.0, ans=0.125 2023-06-24 23:56:28,643 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.81 vs. limit=12.0 2023-06-24 23:56:30,399 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.51 vs. limit=22.5 2023-06-24 23:56:41,754 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1211322.0, ans=0.125 2023-06-24 23:56:54,039 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1211322.0, ans=0.05 2023-06-24 23:57:05,903 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1211382.0, ans=0.125 2023-06-24 23:57:17,723 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.856e+02 2.759e+02 3.206e+02 4.379e+02 8.069e+02, threshold=6.411e+02, percent-clipped=2.0 2023-06-24 23:57:39,996 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.04 vs. limit=15.0 2023-06-24 23:57:42,993 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1211502.0, ans=0.125 2023-06-24 23:57:44,054 INFO [train.py:996] (0/4) Epoch 7, batch 18950, loss[loss=0.209, simple_loss=0.2837, pruned_loss=0.06714, over 21545.00 frames. ], tot_loss[loss=0.2076, simple_loss=0.2801, pruned_loss=0.06753, over 4269882.41 frames. ], batch size: 131, lr: 4.27e-03, grad_scale: 16.0 2023-06-24 23:57:48,981 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.52 vs. limit=10.0 2023-06-24 23:57:56,633 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=1211502.0, ans=15.0 2023-06-24 23:59:20,314 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1211742.0, ans=0.125 2023-06-24 23:59:39,046 INFO [train.py:996] (0/4) Epoch 7, batch 19000, loss[loss=0.2242, simple_loss=0.2755, pruned_loss=0.08646, over 21835.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2903, pruned_loss=0.06998, over 4270929.33 frames. ], batch size: 98, lr: 4.27e-03, grad_scale: 16.0 2023-06-24 23:59:39,747 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1211802.0, ans=0.125 2023-06-25 00:00:33,780 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1211922.0, ans=0.0 2023-06-25 00:00:54,418 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1211982.0, ans=0.125 2023-06-25 00:01:02,158 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.437e+02 3.130e+02 3.898e+02 4.619e+02 8.945e+02, threshold=7.797e+02, percent-clipped=5.0 2023-06-25 00:01:26,862 INFO [train.py:996] (0/4) Epoch 7, batch 19050, loss[loss=0.2338, simple_loss=0.2999, pruned_loss=0.08388, over 21312.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.2947, pruned_loss=0.07302, over 4276596.17 frames. ], batch size: 143, lr: 4.27e-03, grad_scale: 16.0 2023-06-25 00:01:51,370 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1212162.0, ans=0.125 2023-06-25 00:02:41,057 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.55 vs. limit=15.0 2023-06-25 00:02:49,802 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.35 vs. limit=15.0 2023-06-25 00:02:54,141 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1212342.0, ans=0.0 2023-06-25 00:03:13,227 INFO [train.py:996] (0/4) Epoch 7, batch 19100, loss[loss=0.2309, simple_loss=0.2895, pruned_loss=0.08614, over 21259.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.2927, pruned_loss=0.07331, over 4280309.72 frames. ], batch size: 471, lr: 4.27e-03, grad_scale: 16.0 2023-06-25 00:03:48,440 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1212462.0, ans=0.125 2023-06-25 00:03:53,593 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1212522.0, ans=0.2 2023-06-25 00:04:38,992 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.215e+02 2.814e+02 3.416e+02 4.391e+02 9.529e+02, threshold=6.832e+02, percent-clipped=4.0 2023-06-25 00:05:04,655 INFO [train.py:996] (0/4) Epoch 7, batch 19150, loss[loss=0.228, simple_loss=0.324, pruned_loss=0.06599, over 21495.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.2957, pruned_loss=0.07438, over 4276720.76 frames. ], batch size: 230, lr: 4.27e-03, grad_scale: 16.0 2023-06-25 00:05:19,998 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1212702.0, ans=0.0 2023-06-25 00:07:00,456 INFO [train.py:996] (0/4) Epoch 7, batch 19200, loss[loss=0.2651, simple_loss=0.3642, pruned_loss=0.08298, over 21713.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3074, pruned_loss=0.07596, over 4271167.93 frames. ], batch size: 351, lr: 4.27e-03, grad_scale: 32.0 2023-06-25 00:08:23,643 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.016e+02 3.201e+02 4.532e+02 8.099e+02 1.362e+03, threshold=9.063e+02, percent-clipped=31.0 2023-06-25 00:08:48,662 INFO [train.py:996] (0/4) Epoch 7, batch 19250, loss[loss=0.1673, simple_loss=0.2503, pruned_loss=0.0421, over 21382.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.3074, pruned_loss=0.07101, over 4274282.58 frames. ], batch size: 131, lr: 4.27e-03, grad_scale: 32.0 2023-06-25 00:10:29,817 INFO [train.py:996] (0/4) Epoch 7, batch 19300, loss[loss=0.2314, simple_loss=0.293, pruned_loss=0.08484, over 21551.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.3048, pruned_loss=0.07032, over 4273188.69 frames. ], batch size: 548, lr: 4.27e-03, grad_scale: 16.0 2023-06-25 00:11:00,576 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1213662.0, ans=0.1 2023-06-25 00:11:37,665 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1213782.0, ans=0.125 2023-06-25 00:11:48,340 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1213782.0, ans=0.5 2023-06-25 00:12:02,043 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.816e+02 2.613e+02 3.067e+02 3.986e+02 9.865e+02, threshold=6.134e+02, percent-clipped=1.0 2023-06-25 00:12:24,976 INFO [train.py:996] (0/4) Epoch 7, batch 19350, loss[loss=0.1922, simple_loss=0.2764, pruned_loss=0.05404, over 21626.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2991, pruned_loss=0.06712, over 4273599.95 frames. ], batch size: 263, lr: 4.27e-03, grad_scale: 16.0 2023-06-25 00:12:41,394 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1213902.0, ans=0.0 2023-06-25 00:12:46,164 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1213962.0, ans=0.1 2023-06-25 00:13:19,649 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1214022.0, ans=0.125 2023-06-25 00:13:48,961 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1214142.0, ans=0.0 2023-06-25 00:14:11,257 INFO [train.py:996] (0/4) Epoch 7, batch 19400, loss[loss=0.2017, simple_loss=0.2726, pruned_loss=0.06546, over 21679.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2958, pruned_loss=0.06609, over 4278415.49 frames. ], batch size: 230, lr: 4.26e-03, grad_scale: 16.0 2023-06-25 00:14:14,052 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1214202.0, ans=0.125 2023-06-25 00:14:14,591 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.60 vs. limit=15.0 2023-06-25 00:14:25,699 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1214202.0, ans=0.04949747468305833 2023-06-25 00:15:21,958 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1214382.0, ans=0.2 2023-06-25 00:15:34,735 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.117e+02 2.871e+02 3.427e+02 4.239e+02 8.208e+02, threshold=6.853e+02, percent-clipped=6.0 2023-06-25 00:15:58,306 INFO [train.py:996] (0/4) Epoch 7, batch 19450, loss[loss=0.2181, simple_loss=0.2824, pruned_loss=0.07696, over 21248.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2929, pruned_loss=0.06753, over 4277629.15 frames. ], batch size: 159, lr: 4.26e-03, grad_scale: 16.0 2023-06-25 00:16:26,318 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1214562.0, ans=0.125 2023-06-25 00:16:54,597 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1214622.0, ans=0.0 2023-06-25 00:17:20,607 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1214682.0, ans=0.05 2023-06-25 00:17:44,035 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1214742.0, ans=0.0 2023-06-25 00:17:46,968 INFO [train.py:996] (0/4) Epoch 7, batch 19500, loss[loss=0.2872, simple_loss=0.3519, pruned_loss=0.1113, over 21440.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2899, pruned_loss=0.06882, over 4276092.10 frames. ], batch size: 507, lr: 4.26e-03, grad_scale: 16.0 2023-06-25 00:18:03,624 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1214802.0, ans=0.125 2023-06-25 00:18:05,207 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1214802.0, ans=0.0 2023-06-25 00:18:36,123 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1214922.0, ans=0.0 2023-06-25 00:18:37,644 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1214922.0, ans=0.125 2023-06-25 00:19:14,505 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.308e+02 2.919e+02 3.343e+02 4.176e+02 7.589e+02, threshold=6.686e+02, percent-clipped=2.0 2023-06-25 00:19:16,991 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.67 vs. limit=15.0 2023-06-25 00:19:36,584 INFO [train.py:996] (0/4) Epoch 7, batch 19550, loss[loss=0.2, simple_loss=0.3003, pruned_loss=0.04983, over 21758.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.2867, pruned_loss=0.06815, over 4274567.54 frames. ], batch size: 298, lr: 4.26e-03, grad_scale: 16.0 2023-06-25 00:19:58,481 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.02 vs. limit=6.0 2023-06-25 00:20:50,027 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1215282.0, ans=0.015 2023-06-25 00:21:26,535 INFO [train.py:996] (0/4) Epoch 7, batch 19600, loss[loss=0.2197, simple_loss=0.2885, pruned_loss=0.07548, over 21807.00 frames. ], tot_loss[loss=0.2122, simple_loss=0.2882, pruned_loss=0.06812, over 4282350.11 frames. ], batch size: 298, lr: 4.26e-03, grad_scale: 32.0 2023-06-25 00:21:59,237 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.43 vs. limit=15.0 2023-06-25 00:22:28,172 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1215522.0, ans=0.125 2023-06-25 00:22:52,380 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.425e+02 3.092e+02 3.648e+02 4.642e+02 7.608e+02, threshold=7.295e+02, percent-clipped=3.0 2023-06-25 00:23:21,465 INFO [train.py:996] (0/4) Epoch 7, batch 19650, loss[loss=0.2239, simple_loss=0.3044, pruned_loss=0.07174, over 20005.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.2942, pruned_loss=0.07253, over 4282409.38 frames. ], batch size: 702, lr: 4.26e-03, grad_scale: 16.0 2023-06-25 00:23:25,555 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1215702.0, ans=0.125 2023-06-25 00:24:11,711 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1215822.0, ans=0.1 2023-06-25 00:24:17,448 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1215822.0, ans=0.1 2023-06-25 00:24:56,147 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1215942.0, ans=0.035 2023-06-25 00:25:16,213 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1215942.0, ans=0.0 2023-06-25 00:25:19,704 INFO [train.py:996] (0/4) Epoch 7, batch 19700, loss[loss=0.2159, simple_loss=0.2965, pruned_loss=0.06763, over 21637.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.2962, pruned_loss=0.07321, over 4277762.77 frames. ], batch size: 247, lr: 4.26e-03, grad_scale: 16.0 2023-06-25 00:25:48,760 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.56 vs. limit=15.0 2023-06-25 00:25:58,881 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1216062.0, ans=0.125 2023-06-25 00:26:54,000 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.233e+02 3.060e+02 3.533e+02 4.552e+02 9.773e+02, threshold=7.066e+02, percent-clipped=3.0 2023-06-25 00:27:15,067 INFO [train.py:996] (0/4) Epoch 7, batch 19750, loss[loss=0.2092, simple_loss=0.29, pruned_loss=0.06417, over 21394.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.3056, pruned_loss=0.0747, over 4269707.58 frames. ], batch size: 131, lr: 4.26e-03, grad_scale: 16.0 2023-06-25 00:28:40,600 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.91 vs. limit=15.0 2023-06-25 00:28:52,638 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1216542.0, ans=10.0 2023-06-25 00:29:02,197 INFO [train.py:996] (0/4) Epoch 7, batch 19800, loss[loss=0.1893, simple_loss=0.2547, pruned_loss=0.06196, over 21304.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.3035, pruned_loss=0.07451, over 4278192.43 frames. ], batch size: 159, lr: 4.26e-03, grad_scale: 16.0 2023-06-25 00:29:10,251 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1216602.0, ans=0.04949747468305833 2023-06-25 00:29:20,254 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.20 vs. limit=15.0 2023-06-25 00:29:31,332 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1216662.0, ans=0.1 2023-06-25 00:30:13,975 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1216782.0, ans=0.1 2023-06-25 00:30:15,893 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1216782.0, ans=0.125 2023-06-25 00:30:19,236 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1216782.0, ans=0.125 2023-06-25 00:30:30,868 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.982e+02 2.745e+02 3.353e+02 4.359e+02 1.129e+03, threshold=6.706e+02, percent-clipped=10.0 2023-06-25 00:30:52,374 INFO [train.py:996] (0/4) Epoch 7, batch 19850, loss[loss=0.184, simple_loss=0.2727, pruned_loss=0.04764, over 21597.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2965, pruned_loss=0.06961, over 4273283.55 frames. ], batch size: 230, lr: 4.26e-03, grad_scale: 16.0 2023-06-25 00:31:00,455 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.05 vs. limit=15.0 2023-06-25 00:31:32,503 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.50 vs. limit=10.0 2023-06-25 00:31:37,510 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.38 vs. limit=6.0 2023-06-25 00:32:02,681 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1217082.0, ans=0.0 2023-06-25 00:32:33,060 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1217142.0, ans=0.125 2023-06-25 00:32:39,664 INFO [train.py:996] (0/4) Epoch 7, batch 19900, loss[loss=0.2343, simple_loss=0.3, pruned_loss=0.08428, over 21438.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2963, pruned_loss=0.06709, over 4270594.45 frames. ], batch size: 507, lr: 4.26e-03, grad_scale: 16.0 2023-06-25 00:32:44,496 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=25.20 vs. limit=22.5 2023-06-25 00:32:49,476 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1217202.0, ans=0.1 2023-06-25 00:33:14,088 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1217262.0, ans=0.125 2023-06-25 00:34:12,744 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.059e+02 2.818e+02 3.439e+02 4.122e+02 9.461e+02, threshold=6.879e+02, percent-clipped=3.0 2023-06-25 00:34:28,700 INFO [train.py:996] (0/4) Epoch 7, batch 19950, loss[loss=0.2134, simple_loss=0.2841, pruned_loss=0.07131, over 21749.00 frames. ], tot_loss[loss=0.2125, simple_loss=0.2911, pruned_loss=0.06694, over 4269062.28 frames. ], batch size: 102, lr: 4.26e-03, grad_scale: 16.0 2023-06-25 00:35:25,613 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1217622.0, ans=0.05 2023-06-25 00:35:58,647 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1217742.0, ans=0.0 2023-06-25 00:36:17,074 INFO [train.py:996] (0/4) Epoch 7, batch 20000, loss[loss=0.2092, simple_loss=0.2858, pruned_loss=0.06627, over 21514.00 frames. ], tot_loss[loss=0.213, simple_loss=0.2912, pruned_loss=0.06734, over 4252633.76 frames. ], batch size: 195, lr: 4.26e-03, grad_scale: 32.0 2023-06-25 00:36:25,036 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1217802.0, ans=0.0 2023-06-25 00:36:56,046 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1217862.0, ans=0.07 2023-06-25 00:37:35,145 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1217982.0, ans=0.1 2023-06-25 00:37:47,459 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.230e+02 2.923e+02 3.292e+02 4.012e+02 7.608e+02, threshold=6.584e+02, percent-clipped=1.0 2023-06-25 00:38:03,226 INFO [train.py:996] (0/4) Epoch 7, batch 20050, loss[loss=0.2173, simple_loss=0.29, pruned_loss=0.07231, over 21259.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.2936, pruned_loss=0.06997, over 4265558.55 frames. ], batch size: 159, lr: 4.26e-03, grad_scale: 32.0 2023-06-25 00:38:13,860 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1218102.0, ans=0.125 2023-06-25 00:38:51,091 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1218222.0, ans=0.0 2023-06-25 00:39:29,066 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1218282.0, ans=0.125 2023-06-25 00:39:29,098 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 00:39:53,789 INFO [train.py:996] (0/4) Epoch 7, batch 20100, loss[loss=0.2372, simple_loss=0.3168, pruned_loss=0.07873, over 21817.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.296, pruned_loss=0.07255, over 4275414.99 frames. ], batch size: 298, lr: 4.26e-03, grad_scale: 16.0 2023-06-25 00:40:34,022 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1218462.0, ans=0.05 2023-06-25 00:40:40,816 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1218462.0, ans=0.125 2023-06-25 00:40:40,916 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1218462.0, ans=0.125 2023-06-25 00:41:05,255 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1218522.0, ans=0.125 2023-06-25 00:41:29,516 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.560e+02 2.968e+02 3.649e+02 4.781e+02 8.701e+02, threshold=7.299e+02, percent-clipped=5.0 2023-06-25 00:41:49,534 INFO [train.py:996] (0/4) Epoch 7, batch 20150, loss[loss=0.2021, simple_loss=0.2495, pruned_loss=0.0774, over 20357.00 frames. ], tot_loss[loss=0.2276, simple_loss=0.3038, pruned_loss=0.07569, over 4275232.31 frames. ], batch size: 703, lr: 4.26e-03, grad_scale: 16.0 2023-06-25 00:41:54,500 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.55 vs. limit=15.0 2023-06-25 00:42:09,248 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1218702.0, ans=0.0 2023-06-25 00:42:22,048 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1218762.0, ans=0.1 2023-06-25 00:42:27,646 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1218762.0, ans=0.1 2023-06-25 00:42:50,460 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.08 vs. limit=15.0 2023-06-25 00:43:51,477 INFO [train.py:996] (0/4) Epoch 7, batch 20200, loss[loss=0.2569, simple_loss=0.3664, pruned_loss=0.07369, over 20755.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3093, pruned_loss=0.0786, over 4276329.41 frames. ], batch size: 607, lr: 4.26e-03, grad_scale: 16.0 2023-06-25 00:44:09,535 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.92 vs. limit=15.0 2023-06-25 00:45:22,368 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.473e+02 3.331e+02 3.947e+02 5.099e+02 9.386e+02, threshold=7.894e+02, percent-clipped=7.0 2023-06-25 00:45:36,279 INFO [train.py:996] (0/4) Epoch 7, batch 20250, loss[loss=0.2264, simple_loss=0.2997, pruned_loss=0.07659, over 21328.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3103, pruned_loss=0.07725, over 4275412.48 frames. ], batch size: 176, lr: 4.26e-03, grad_scale: 16.0 2023-06-25 00:47:08,290 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1219542.0, ans=0.125 2023-06-25 00:47:25,051 INFO [train.py:996] (0/4) Epoch 7, batch 20300, loss[loss=0.2201, simple_loss=0.3135, pruned_loss=0.06335, over 21610.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3087, pruned_loss=0.07471, over 4274986.24 frames. ], batch size: 389, lr: 4.26e-03, grad_scale: 16.0 2023-06-25 00:47:32,226 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1219602.0, ans=0.125 2023-06-25 00:47:47,396 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1219662.0, ans=0.0 2023-06-25 00:48:02,211 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.68 vs. limit=15.0 2023-06-25 00:48:10,957 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.15 vs. limit=15.0 2023-06-25 00:48:24,352 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1219782.0, ans=0.0 2023-06-25 00:48:52,980 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.137e+02 2.615e+02 3.044e+02 3.787e+02 8.411e+02, threshold=6.088e+02, percent-clipped=1.0 2023-06-25 00:49:06,426 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.25 vs. limit=15.0 2023-06-25 00:49:06,542 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.78 vs. limit=10.0 2023-06-25 00:49:11,901 INFO [train.py:996] (0/4) Epoch 7, batch 20350, loss[loss=0.2273, simple_loss=0.3058, pruned_loss=0.07438, over 21857.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.3076, pruned_loss=0.07411, over 4267572.75 frames. ], batch size: 351, lr: 4.25e-03, grad_scale: 16.0 2023-06-25 00:49:18,242 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1219902.0, ans=0.125 2023-06-25 00:49:33,771 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1219962.0, ans=0.0 2023-06-25 00:50:05,524 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1220022.0, ans=0.04949747468305833 2023-06-25 00:50:26,240 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1220082.0, ans=0.0 2023-06-25 00:50:27,714 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1220082.0, ans=0.1 2023-06-25 00:50:29,714 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.17 vs. limit=10.0 2023-06-25 00:50:56,364 INFO [train.py:996] (0/4) Epoch 7, batch 20400, loss[loss=0.2367, simple_loss=0.3425, pruned_loss=0.06542, over 19861.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3106, pruned_loss=0.07655, over 4261297.49 frames. ], batch size: 704, lr: 4.25e-03, grad_scale: 32.0 2023-06-25 00:51:12,875 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1220202.0, ans=0.125 2023-06-25 00:51:21,911 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1220262.0, ans=0.1 2023-06-25 00:51:21,981 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1220262.0, ans=0.0 2023-06-25 00:51:49,596 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1220322.0, ans=0.125 2023-06-25 00:51:53,147 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1220382.0, ans=0.125 2023-06-25 00:51:56,362 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1220382.0, ans=0.125 2023-06-25 00:52:09,593 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1220382.0, ans=0.125 2023-06-25 00:52:32,835 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.319e+02 3.347e+02 3.963e+02 4.819e+02 8.468e+02, threshold=7.927e+02, percent-clipped=6.0 2023-06-25 00:52:36,838 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1220442.0, ans=0.125 2023-06-25 00:52:38,958 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.72 vs. limit=22.5 2023-06-25 00:52:44,836 INFO [train.py:996] (0/4) Epoch 7, batch 20450, loss[loss=0.2499, simple_loss=0.3098, pruned_loss=0.09495, over 21924.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3112, pruned_loss=0.07872, over 4261168.48 frames. ], batch size: 113, lr: 4.25e-03, grad_scale: 16.0 2023-06-25 00:53:02,235 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1220502.0, ans=0.125 2023-06-25 00:54:25,818 INFO [train.py:996] (0/4) Epoch 7, batch 20500, loss[loss=0.2064, simple_loss=0.2656, pruned_loss=0.07358, over 21375.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3079, pruned_loss=0.07886, over 4247734.75 frames. ], batch size: 548, lr: 4.25e-03, grad_scale: 16.0 2023-06-25 00:54:32,890 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1220802.0, ans=10.0 2023-06-25 00:55:17,697 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.13 vs. limit=10.0 2023-06-25 00:55:26,255 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1220982.0, ans=0.125 2023-06-25 00:55:37,788 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1220982.0, ans=0.2 2023-06-25 00:56:00,856 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.376e+02 3.204e+02 4.054e+02 5.426e+02 8.867e+02, threshold=8.109e+02, percent-clipped=2.0 2023-06-25 00:56:01,453 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=1221042.0, ans=0.05 2023-06-25 00:56:13,073 INFO [train.py:996] (0/4) Epoch 7, batch 20550, loss[loss=0.1813, simple_loss=0.2565, pruned_loss=0.05305, over 16153.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.3012, pruned_loss=0.07694, over 4232480.82 frames. ], batch size: 60, lr: 4.25e-03, grad_scale: 16.0 2023-06-25 00:56:47,663 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.90 vs. limit=15.0 2023-06-25 00:56:55,137 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.57 vs. limit=15.0 2023-06-25 00:57:10,249 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1221282.0, ans=0.1 2023-06-25 00:57:36,394 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1221282.0, ans=0.1 2023-06-25 00:57:56,532 INFO [train.py:996] (0/4) Epoch 7, batch 20600, loss[loss=0.2317, simple_loss=0.3205, pruned_loss=0.07144, over 16853.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.3019, pruned_loss=0.0752, over 4214498.98 frames. ], batch size: 60, lr: 4.25e-03, grad_scale: 16.0 2023-06-25 00:58:00,863 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1221402.0, ans=0.1 2023-06-25 00:58:12,772 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1221402.0, ans=0.04949747468305833 2023-06-25 00:58:39,285 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1221522.0, ans=0.09899494936611666 2023-06-25 00:59:25,531 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.194e+02 3.095e+02 3.828e+02 5.103e+02 1.106e+03, threshold=7.656e+02, percent-clipped=7.0 2023-06-25 00:59:37,774 INFO [train.py:996] (0/4) Epoch 7, batch 20650, loss[loss=0.2173, simple_loss=0.2936, pruned_loss=0.07053, over 21789.00 frames. ], tot_loss[loss=0.225, simple_loss=0.299, pruned_loss=0.0755, over 4236212.08 frames. ], batch size: 351, lr: 4.25e-03, grad_scale: 16.0 2023-06-25 00:59:50,164 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1221702.0, ans=0.0 2023-06-25 01:00:25,323 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1221822.0, ans=0.125 2023-06-25 01:01:20,267 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1221942.0, ans=0.0 2023-06-25 01:01:20,297 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1221942.0, ans=0.125 2023-06-25 01:01:32,860 INFO [train.py:996] (0/4) Epoch 7, batch 20700, loss[loss=0.3023, simple_loss=0.3801, pruned_loss=0.1122, over 21491.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.2929, pruned_loss=0.07275, over 4248570.94 frames. ], batch size: 508, lr: 4.25e-03, grad_scale: 16.0 2023-06-25 01:01:47,368 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1222002.0, ans=0.0 2023-06-25 01:01:52,337 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1222062.0, ans=0.0 2023-06-25 01:02:58,572 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1222182.0, ans=0.0 2023-06-25 01:03:06,667 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.957e+02 2.936e+02 3.801e+02 5.565e+02 1.085e+03, threshold=7.602e+02, percent-clipped=14.0 2023-06-25 01:03:24,065 INFO [train.py:996] (0/4) Epoch 7, batch 20750, loss[loss=0.2279, simple_loss=0.3158, pruned_loss=0.06999, over 21587.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.2941, pruned_loss=0.07168, over 4258425.97 frames. ], batch size: 230, lr: 4.25e-03, grad_scale: 16.0 2023-06-25 01:03:49,725 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1222362.0, ans=0.0 2023-06-25 01:03:52,784 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1222362.0, ans=0.2 2023-06-25 01:04:19,591 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1222422.0, ans=0.125 2023-06-25 01:04:38,737 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1222482.0, ans=0.1 2023-06-25 01:04:54,350 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1222542.0, ans=0.125 2023-06-25 01:05:07,444 INFO [train.py:996] (0/4) Epoch 7, batch 20800, loss[loss=0.2207, simple_loss=0.3023, pruned_loss=0.06958, over 21647.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.2989, pruned_loss=0.07302, over 4263744.20 frames. ], batch size: 332, lr: 4.25e-03, grad_scale: 32.0 2023-06-25 01:05:20,610 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1222602.0, ans=0.0 2023-06-25 01:06:39,392 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1222842.0, ans=0.1 2023-06-25 01:06:43,841 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.363e+02 3.312e+02 4.339e+02 6.808e+02 1.439e+03, threshold=8.678e+02, percent-clipped=19.0 2023-06-25 01:06:47,930 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 01:06:55,802 INFO [train.py:996] (0/4) Epoch 7, batch 20850, loss[loss=0.1816, simple_loss=0.2489, pruned_loss=0.05716, over 21506.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2927, pruned_loss=0.0711, over 4268503.38 frames. ], batch size: 212, lr: 4.25e-03, grad_scale: 32.0 2023-06-25 01:07:23,342 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1222962.0, ans=0.0 2023-06-25 01:07:51,585 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.77 vs. limit=15.0 2023-06-25 01:07:58,684 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1223022.0, ans=0.125 2023-06-25 01:07:59,343 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.11 vs. limit=15.0 2023-06-25 01:08:01,865 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1223022.0, ans=0.125 2023-06-25 01:08:38,268 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1223142.0, ans=0.1 2023-06-25 01:08:44,708 INFO [train.py:996] (0/4) Epoch 7, batch 20900, loss[loss=0.2357, simple_loss=0.3084, pruned_loss=0.08145, over 21869.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2928, pruned_loss=0.07238, over 4281378.44 frames. ], batch size: 124, lr: 4.25e-03, grad_scale: 32.0 2023-06-25 01:08:48,762 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.88 vs. limit=15.0 2023-06-25 01:08:51,578 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1223202.0, ans=0.125 2023-06-25 01:08:56,738 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1223202.0, ans=0.125 2023-06-25 01:09:19,128 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1223262.0, ans=0.07 2023-06-25 01:10:19,931 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.054e+02 2.894e+02 3.467e+02 4.402e+02 7.475e+02, threshold=6.935e+02, percent-clipped=1.0 2023-06-25 01:10:30,296 INFO [train.py:996] (0/4) Epoch 7, batch 20950, loss[loss=0.1778, simple_loss=0.2543, pruned_loss=0.05068, over 20822.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2889, pruned_loss=0.06877, over 4269194.79 frames. ], batch size: 608, lr: 4.25e-03, grad_scale: 16.0 2023-06-25 01:11:52,704 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1223742.0, ans=0.125 2023-06-25 01:12:00,756 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1223742.0, ans=0.0 2023-06-25 01:12:09,747 INFO [train.py:996] (0/4) Epoch 7, batch 21000, loss[loss=0.2256, simple_loss=0.3073, pruned_loss=0.07195, over 21830.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2872, pruned_loss=0.069, over 4281029.47 frames. ], batch size: 282, lr: 4.25e-03, grad_scale: 16.0 2023-06-25 01:12:09,748 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-25 01:12:27,621 INFO [train.py:1028] (0/4) Epoch 7, validation: loss=0.2666, simple_loss=0.3633, pruned_loss=0.08493, over 1796401.00 frames. 2023-06-25 01:12:27,621 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23616MB 2023-06-25 01:12:46,639 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1223862.0, ans=0.1 2023-06-25 01:13:18,382 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1223922.0, ans=0.07 2023-06-25 01:13:21,854 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1223922.0, ans=0.125 2023-06-25 01:13:25,235 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1223922.0, ans=0.125 2023-06-25 01:13:33,604 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1223922.0, ans=0.125 2023-06-25 01:13:35,117 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1223982.0, ans=0.0 2023-06-25 01:13:36,561 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1223982.0, ans=0.1 2023-06-25 01:13:39,522 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-204000.pt 2023-06-25 01:13:47,268 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.44 vs. limit=15.0 2023-06-25 01:14:00,037 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1224042.0, ans=0.0 2023-06-25 01:14:00,189 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1224042.0, ans=0.0 2023-06-25 01:14:05,669 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1224042.0, ans=0.125 2023-06-25 01:14:06,890 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.137e+02 2.703e+02 3.087e+02 3.976e+02 6.503e+02, threshold=6.174e+02, percent-clipped=0.0 2023-06-25 01:14:17,199 INFO [train.py:996] (0/4) Epoch 7, batch 21050, loss[loss=0.2107, simple_loss=0.2781, pruned_loss=0.07161, over 21256.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2848, pruned_loss=0.06922, over 4286122.08 frames. ], batch size: 159, lr: 4.25e-03, grad_scale: 16.0 2023-06-25 01:15:24,495 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1224282.0, ans=0.125 2023-06-25 01:16:05,182 INFO [train.py:996] (0/4) Epoch 7, batch 21100, loss[loss=0.2067, simple_loss=0.2761, pruned_loss=0.06868, over 21311.00 frames. ], tot_loss[loss=0.209, simple_loss=0.2812, pruned_loss=0.06841, over 4271509.42 frames. ], batch size: 177, lr: 4.25e-03, grad_scale: 16.0 2023-06-25 01:16:05,720 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1224402.0, ans=0.0 2023-06-25 01:16:51,849 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1224522.0, ans=0.0 2023-06-25 01:17:22,199 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.34 vs. limit=15.0 2023-06-25 01:17:42,071 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.130e+02 2.657e+02 3.143e+02 4.101e+02 9.163e+02, threshold=6.287e+02, percent-clipped=4.0 2023-06-25 01:17:50,605 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.99 vs. limit=15.0 2023-06-25 01:17:52,643 INFO [train.py:996] (0/4) Epoch 7, batch 21150, loss[loss=0.2013, simple_loss=0.264, pruned_loss=0.06927, over 21772.00 frames. ], tot_loss[loss=0.2074, simple_loss=0.2777, pruned_loss=0.06849, over 4258195.47 frames. ], batch size: 317, lr: 4.25e-03, grad_scale: 16.0 2023-06-25 01:19:39,297 INFO [train.py:996] (0/4) Epoch 7, batch 21200, loss[loss=0.191, simple_loss=0.2269, pruned_loss=0.07749, over 20108.00 frames. ], tot_loss[loss=0.2048, simple_loss=0.2736, pruned_loss=0.06799, over 4256127.85 frames. ], batch size: 703, lr: 4.25e-03, grad_scale: 32.0 2023-06-25 01:19:52,666 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1225002.0, ans=0.125 2023-06-25 01:20:23,013 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.28 vs. limit=22.5 2023-06-25 01:20:51,504 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1225182.0, ans=0.125 2023-06-25 01:21:17,799 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.018e+02 2.659e+02 3.125e+02 3.870e+02 6.186e+02, threshold=6.250e+02, percent-clipped=0.0 2023-06-25 01:21:18,921 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.76 vs. limit=15.0 2023-06-25 01:21:28,375 INFO [train.py:996] (0/4) Epoch 7, batch 21250, loss[loss=0.1755, simple_loss=0.2479, pruned_loss=0.05155, over 21186.00 frames. ], tot_loss[loss=0.2031, simple_loss=0.2716, pruned_loss=0.06735, over 4248239.69 frames. ], batch size: 176, lr: 4.25e-03, grad_scale: 32.0 2023-06-25 01:21:32,959 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.79 vs. limit=22.5 2023-06-25 01:22:30,387 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.33 vs. limit=10.0 2023-06-25 01:23:02,265 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1225542.0, ans=0.125 2023-06-25 01:23:05,282 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1225542.0, ans=0.125 2023-06-25 01:23:15,843 INFO [train.py:996] (0/4) Epoch 7, batch 21300, loss[loss=0.2279, simple_loss=0.3035, pruned_loss=0.07613, over 21846.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.2798, pruned_loss=0.06932, over 4252696.35 frames. ], batch size: 391, lr: 4.25e-03, grad_scale: 32.0 2023-06-25 01:24:17,066 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=5.82 vs. limit=22.5 2023-06-25 01:24:50,845 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1225842.0, ans=0.125 2023-06-25 01:24:55,313 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.338e+02 2.894e+02 3.300e+02 4.575e+02 9.382e+02, threshold=6.600e+02, percent-clipped=9.0 2023-06-25 01:25:02,057 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.79 vs. limit=22.5 2023-06-25 01:25:04,034 INFO [train.py:996] (0/4) Epoch 7, batch 21350, loss[loss=0.1925, simple_loss=0.2881, pruned_loss=0.04852, over 21822.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.2835, pruned_loss=0.07001, over 4251515.07 frames. ], batch size: 316, lr: 4.24e-03, grad_scale: 16.0 2023-06-25 01:25:15,127 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1225902.0, ans=0.0 2023-06-25 01:25:18,836 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.79 vs. limit=15.0 2023-06-25 01:25:42,022 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1225962.0, ans=0.125 2023-06-25 01:25:48,357 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.98 vs. limit=15.0 2023-06-25 01:26:35,641 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.98 vs. limit=15.0 2023-06-25 01:26:51,942 INFO [train.py:996] (0/4) Epoch 7, batch 21400, loss[loss=0.1925, simple_loss=0.2641, pruned_loss=0.06045, over 21683.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2868, pruned_loss=0.0702, over 4251617.18 frames. ], batch size: 112, lr: 4.24e-03, grad_scale: 16.0 2023-06-25 01:26:52,602 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 01:26:58,400 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.49 vs. limit=22.5 2023-06-25 01:27:34,614 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.37 vs. limit=15.0 2023-06-25 01:27:50,867 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.61 vs. limit=15.0 2023-06-25 01:27:59,442 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.38 vs. limit=12.0 2023-06-25 01:28:02,763 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.96 vs. limit=15.0 2023-06-25 01:28:04,549 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.02 vs. limit=15.0 2023-06-25 01:28:05,990 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1226382.0, ans=0.0 2023-06-25 01:28:31,716 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.113e+02 3.088e+02 4.012e+02 5.119e+02 7.296e+02, threshold=8.024e+02, percent-clipped=4.0 2023-06-25 01:28:40,335 INFO [train.py:996] (0/4) Epoch 7, batch 21450, loss[loss=0.2458, simple_loss=0.3135, pruned_loss=0.08903, over 21890.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2907, pruned_loss=0.07186, over 4264337.42 frames. ], batch size: 124, lr: 4.24e-03, grad_scale: 16.0 2023-06-25 01:29:40,963 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.14 vs. limit=15.0 2023-06-25 01:29:48,801 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1226682.0, ans=0.125 2023-06-25 01:30:01,458 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.03 vs. limit=15.0 2023-06-25 01:30:04,412 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1226682.0, ans=0.09899494936611666 2023-06-25 01:30:17,306 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.03 vs. limit=15.0 2023-06-25 01:30:28,718 INFO [train.py:996] (0/4) Epoch 7, batch 21500, loss[loss=0.2611, simple_loss=0.2982, pruned_loss=0.112, over 21538.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.2901, pruned_loss=0.07274, over 4273290.90 frames. ], batch size: 511, lr: 4.24e-03, grad_scale: 16.0 2023-06-25 01:31:20,636 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1226922.0, ans=0.2 2023-06-25 01:31:51,232 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1226982.0, ans=0.0 2023-06-25 01:32:06,008 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.254e+02 2.889e+02 3.383e+02 4.228e+02 8.142e+02, threshold=6.766e+02, percent-clipped=1.0 2023-06-25 01:32:14,667 INFO [train.py:996] (0/4) Epoch 7, batch 21550, loss[loss=0.3009, simple_loss=0.4286, pruned_loss=0.08661, over 19692.00 frames. ], tot_loss[loss=0.2125, simple_loss=0.2841, pruned_loss=0.07045, over 4256649.94 frames. ], batch size: 702, lr: 4.24e-03, grad_scale: 16.0 2023-06-25 01:32:15,786 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.33 vs. limit=15.0 2023-06-25 01:33:03,232 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1227222.0, ans=0.0 2023-06-25 01:33:36,638 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1227282.0, ans=0.0 2023-06-25 01:33:50,749 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1227342.0, ans=0.0 2023-06-25 01:33:58,114 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1227402.0, ans=0.125 2023-06-25 01:33:59,009 INFO [train.py:996] (0/4) Epoch 7, batch 21600, loss[loss=0.2026, simple_loss=0.2658, pruned_loss=0.06972, over 21828.00 frames. ], tot_loss[loss=0.2082, simple_loss=0.2787, pruned_loss=0.06885, over 4256618.96 frames. ], batch size: 352, lr: 4.24e-03, grad_scale: 32.0 2023-06-25 01:34:23,003 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1227402.0, ans=0.0 2023-06-25 01:34:30,591 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=1227462.0, ans=6.0 2023-06-25 01:35:06,564 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1227522.0, ans=0.125 2023-06-25 01:35:40,097 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.017e+02 2.809e+02 3.415e+02 4.856e+02 1.279e+03, threshold=6.830e+02, percent-clipped=8.0 2023-06-25 01:35:47,646 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1227702.0, ans=0.125 2023-06-25 01:35:53,454 INFO [train.py:996] (0/4) Epoch 7, batch 21650, loss[loss=0.2025, simple_loss=0.3062, pruned_loss=0.04939, over 21799.00 frames. ], tot_loss[loss=0.2064, simple_loss=0.2804, pruned_loss=0.06625, over 4258344.67 frames. ], batch size: 316, lr: 4.24e-03, grad_scale: 16.0 2023-06-25 01:35:53,997 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1227702.0, ans=0.2 2023-06-25 01:36:08,551 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1227702.0, ans=0.0 2023-06-25 01:37:08,042 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1227882.0, ans=0.125 2023-06-25 01:37:26,613 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1227942.0, ans=0.0 2023-06-25 01:37:34,948 INFO [train.py:996] (0/4) Epoch 7, batch 21700, loss[loss=0.2047, simple_loss=0.2964, pruned_loss=0.05644, over 21773.00 frames. ], tot_loss[loss=0.2058, simple_loss=0.282, pruned_loss=0.06477, over 4258319.97 frames. ], batch size: 298, lr: 4.24e-03, grad_scale: 16.0 2023-06-25 01:37:35,522 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1228002.0, ans=0.125 2023-06-25 01:38:07,644 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1228062.0, ans=0.2 2023-06-25 01:38:24,062 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.67 vs. limit=15.0 2023-06-25 01:38:48,064 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.43 vs. limit=12.0 2023-06-25 01:38:53,819 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1228182.0, ans=0.125 2023-06-25 01:38:57,268 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1228182.0, ans=0.125 2023-06-25 01:39:14,327 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.145e+02 3.013e+02 3.692e+02 5.814e+02 1.203e+03, threshold=7.384e+02, percent-clipped=13.0 2023-06-25 01:39:20,992 INFO [train.py:996] (0/4) Epoch 7, batch 21750, loss[loss=0.184, simple_loss=0.2457, pruned_loss=0.06114, over 21297.00 frames. ], tot_loss[loss=0.2039, simple_loss=0.2776, pruned_loss=0.06506, over 4246266.70 frames. ], batch size: 160, lr: 4.24e-03, grad_scale: 16.0 2023-06-25 01:40:11,180 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1228422.0, ans=0.125 2023-06-25 01:40:12,643 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1228422.0, ans=0.125 2023-06-25 01:40:14,616 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1228422.0, ans=0.0 2023-06-25 01:40:16,160 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1228422.0, ans=0.125 2023-06-25 01:40:30,389 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1228422.0, ans=0.0 2023-06-25 01:40:37,273 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 01:41:04,206 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1228542.0, ans=0.125 2023-06-25 01:41:08,625 INFO [train.py:996] (0/4) Epoch 7, batch 21800, loss[loss=0.2004, simple_loss=0.305, pruned_loss=0.04786, over 20805.00 frames. ], tot_loss[loss=0.2052, simple_loss=0.2786, pruned_loss=0.06593, over 4248885.40 frames. ], batch size: 607, lr: 4.24e-03, grad_scale: 16.0 2023-06-25 01:41:09,933 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.06 vs. limit=6.0 2023-06-25 01:41:36,061 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1228662.0, ans=0.0 2023-06-25 01:41:41,123 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1228662.0, ans=0.125 2023-06-25 01:41:47,124 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.52 vs. limit=6.0 2023-06-25 01:41:57,195 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1228722.0, ans=10.0 2023-06-25 01:42:45,926 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.478e+02 3.194e+02 4.069e+02 5.190e+02 9.750e+02, threshold=8.138e+02, percent-clipped=3.0 2023-06-25 01:42:53,048 INFO [train.py:996] (0/4) Epoch 7, batch 21850, loss[loss=0.2127, simple_loss=0.2849, pruned_loss=0.07027, over 21899.00 frames. ], tot_loss[loss=0.2084, simple_loss=0.283, pruned_loss=0.06693, over 4238748.53 frames. ], batch size: 316, lr: 4.24e-03, grad_scale: 16.0 2023-06-25 01:42:53,644 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1228902.0, ans=0.1 2023-06-25 01:43:20,829 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1228962.0, ans=0.1 2023-06-25 01:43:32,762 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1228962.0, ans=0.0 2023-06-25 01:43:47,511 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.93 vs. limit=6.0 2023-06-25 01:43:55,686 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1229022.0, ans=0.125 2023-06-25 01:43:55,805 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1229022.0, ans=0.0 2023-06-25 01:44:33,846 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.14 vs. limit=10.0 2023-06-25 01:44:44,842 INFO [train.py:996] (0/4) Epoch 7, batch 21900, loss[loss=0.1964, simple_loss=0.2685, pruned_loss=0.06218, over 21660.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.2844, pruned_loss=0.06866, over 4246132.71 frames. ], batch size: 263, lr: 4.24e-03, grad_scale: 16.0 2023-06-25 01:45:00,471 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1229202.0, ans=0.125 2023-06-25 01:45:07,329 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1229262.0, ans=0.125 2023-06-25 01:45:51,176 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.63 vs. limit=15.0 2023-06-25 01:46:19,671 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.343e+02 2.991e+02 3.581e+02 4.789e+02 1.002e+03, threshold=7.161e+02, percent-clipped=1.0 2023-06-25 01:46:31,075 INFO [train.py:996] (0/4) Epoch 7, batch 21950, loss[loss=0.1716, simple_loss=0.2585, pruned_loss=0.04231, over 21535.00 frames. ], tot_loss[loss=0.2072, simple_loss=0.2796, pruned_loss=0.06738, over 4241064.49 frames. ], batch size: 441, lr: 4.24e-03, grad_scale: 16.0 2023-06-25 01:47:07,539 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1229562.0, ans=0.1 2023-06-25 01:47:17,058 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1229622.0, ans=0.1 2023-06-25 01:47:17,579 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.09 vs. limit=10.0 2023-06-25 01:47:32,566 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1229622.0, ans=0.1 2023-06-25 01:47:45,001 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1229682.0, ans=0.04949747468305833 2023-06-25 01:47:48,385 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1229682.0, ans=0.125 2023-06-25 01:47:59,443 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1229742.0, ans=0.125 2023-06-25 01:48:26,821 INFO [train.py:996] (0/4) Epoch 7, batch 22000, loss[loss=0.1526, simple_loss=0.2302, pruned_loss=0.03749, over 21498.00 frames. ], tot_loss[loss=0.2024, simple_loss=0.2741, pruned_loss=0.06533, over 4238646.17 frames. ], batch size: 195, lr: 4.24e-03, grad_scale: 32.0 2023-06-25 01:49:09,833 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1229862.0, ans=0.0 2023-06-25 01:49:29,359 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1229922.0, ans=0.0 2023-06-25 01:49:46,711 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1229982.0, ans=0.125 2023-06-25 01:50:12,188 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.130e+02 3.193e+02 3.853e+02 5.102e+02 1.201e+03, threshold=7.707e+02, percent-clipped=7.0 2023-06-25 01:50:14,562 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1230042.0, ans=0.025 2023-06-25 01:50:17,732 INFO [train.py:996] (0/4) Epoch 7, batch 22050, loss[loss=0.2001, simple_loss=0.2687, pruned_loss=0.06575, over 21370.00 frames. ], tot_loss[loss=0.2068, simple_loss=0.2798, pruned_loss=0.06691, over 4232631.86 frames. ], batch size: 194, lr: 4.24e-03, grad_scale: 16.0 2023-06-25 01:50:18,454 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1230102.0, ans=0.125 2023-06-25 01:51:18,329 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.48 vs. limit=15.0 2023-06-25 01:51:19,358 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1230222.0, ans=0.2 2023-06-25 01:51:28,050 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1230282.0, ans=0.125 2023-06-25 01:51:40,436 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.75 vs. limit=15.0 2023-06-25 01:52:06,973 INFO [train.py:996] (0/4) Epoch 7, batch 22100, loss[loss=0.2392, simple_loss=0.3141, pruned_loss=0.08219, over 21762.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.291, pruned_loss=0.0717, over 4236972.29 frames. ], batch size: 332, lr: 4.24e-03, grad_scale: 16.0 2023-06-25 01:52:07,481 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1230402.0, ans=0.2 2023-06-25 01:52:38,620 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 01:52:47,742 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1230462.0, ans=0.1 2023-06-25 01:52:58,518 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.00 vs. limit=12.0 2023-06-25 01:52:59,519 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1230522.0, ans=0.125 2023-06-25 01:53:49,132 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.658e+02 3.415e+02 4.118e+02 5.475e+02 8.069e+02, threshold=8.235e+02, percent-clipped=4.0 2023-06-25 01:53:53,252 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1230702.0, ans=0.125 2023-06-25 01:53:54,208 INFO [train.py:996] (0/4) Epoch 7, batch 22150, loss[loss=0.2422, simple_loss=0.3064, pruned_loss=0.08897, over 21774.00 frames. ], tot_loss[loss=0.22, simple_loss=0.2943, pruned_loss=0.07281, over 4253206.00 frames. ], batch size: 441, lr: 4.24e-03, grad_scale: 16.0 2023-06-25 01:54:12,970 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.49 vs. limit=15.0 2023-06-25 01:54:25,075 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1230762.0, ans=0.0 2023-06-25 01:54:38,947 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1230822.0, ans=0.125 2023-06-25 01:54:43,884 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1230822.0, ans=0.1 2023-06-25 01:55:17,684 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1230942.0, ans=0.125 2023-06-25 01:55:41,163 INFO [train.py:996] (0/4) Epoch 7, batch 22200, loss[loss=0.2289, simple_loss=0.3129, pruned_loss=0.07249, over 21362.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.2972, pruned_loss=0.07368, over 4268107.98 frames. ], batch size: 144, lr: 4.24e-03, grad_scale: 16.0 2023-06-25 01:57:00,407 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.09 vs. limit=10.0 2023-06-25 01:57:03,811 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.13 vs. limit=10.0 2023-06-25 01:57:19,405 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.75 vs. limit=22.5 2023-06-25 01:57:25,287 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.355e+02 3.120e+02 3.891e+02 5.411e+02 1.488e+03, threshold=7.782e+02, percent-clipped=8.0 2023-06-25 01:57:31,146 INFO [train.py:996] (0/4) Epoch 7, batch 22250, loss[loss=0.2537, simple_loss=0.3351, pruned_loss=0.08614, over 21467.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.303, pruned_loss=0.07517, over 4269690.63 frames. ], batch size: 131, lr: 4.24e-03, grad_scale: 16.0 2023-06-25 01:57:46,020 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.88 vs. limit=6.0 2023-06-25 01:58:18,454 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1231422.0, ans=0.1 2023-06-25 01:58:30,442 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1231422.0, ans=0.125 2023-06-25 01:58:39,526 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.77 vs. limit=6.0 2023-06-25 01:59:15,447 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1231542.0, ans=0.2 2023-06-25 01:59:18,349 INFO [train.py:996] (0/4) Epoch 7, batch 22300, loss[loss=0.2348, simple_loss=0.302, pruned_loss=0.08378, over 21296.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3044, pruned_loss=0.07708, over 4275970.90 frames. ], batch size: 143, lr: 4.23e-03, grad_scale: 16.0 2023-06-25 01:59:44,735 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1231662.0, ans=0.125 2023-06-25 02:00:13,161 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.82 vs. limit=15.0 2023-06-25 02:00:31,233 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1231782.0, ans=0.0 2023-06-25 02:00:43,057 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1231782.0, ans=0.125 2023-06-25 02:00:57,384 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1231842.0, ans=0.1 2023-06-25 02:01:00,005 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.515e+02 3.143e+02 3.997e+02 5.587e+02 8.969e+02, threshold=7.995e+02, percent-clipped=6.0 2023-06-25 02:01:10,891 INFO [train.py:996] (0/4) Epoch 7, batch 22350, loss[loss=0.2331, simple_loss=0.3002, pruned_loss=0.083, over 21762.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3028, pruned_loss=0.07754, over 4284790.06 frames. ], batch size: 112, lr: 4.23e-03, grad_scale: 16.0 2023-06-25 02:01:20,224 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1231902.0, ans=0.125 2023-06-25 02:01:20,882 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.17 vs. limit=15.0 2023-06-25 02:01:25,658 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1231902.0, ans=0.2 2023-06-25 02:01:26,287 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.95 vs. limit=15.0 2023-06-25 02:01:32,572 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1231962.0, ans=0.2 2023-06-25 02:01:32,660 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1231962.0, ans=0.0 2023-06-25 02:01:34,105 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1231962.0, ans=0.125 2023-06-25 02:01:36,994 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.87 vs. limit=12.0 2023-06-25 02:01:37,008 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.00 vs. limit=22.5 2023-06-25 02:01:38,232 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1231962.0, ans=0.125 2023-06-25 02:01:40,204 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.77 vs. limit=15.0 2023-06-25 02:01:47,432 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.71 vs. limit=10.0 2023-06-25 02:01:52,115 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 02:02:28,799 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1232082.0, ans=0.2 2023-06-25 02:02:29,399 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.80 vs. limit=15.0 2023-06-25 02:02:35,268 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1232142.0, ans=0.0 2023-06-25 02:02:50,729 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1232142.0, ans=0.2 2023-06-25 02:02:50,735 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_na.min_abs, batch_count=1232142.0, ans=0.02 2023-06-25 02:02:59,814 INFO [train.py:996] (0/4) Epoch 7, batch 22400, loss[loss=0.2079, simple_loss=0.2797, pruned_loss=0.06804, over 21892.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3014, pruned_loss=0.07568, over 4282409.52 frames. ], batch size: 107, lr: 4.23e-03, grad_scale: 32.0 2023-06-25 02:04:42,735 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.236e+02 2.737e+02 3.177e+02 4.252e+02 6.969e+02, threshold=6.354e+02, percent-clipped=0.0 2023-06-25 02:04:48,088 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.00 vs. limit=15.0 2023-06-25 02:04:48,420 INFO [train.py:996] (0/4) Epoch 7, batch 22450, loss[loss=0.1957, simple_loss=0.2528, pruned_loss=0.06933, over 21497.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.2943, pruned_loss=0.07425, over 4272615.48 frames. ], batch size: 441, lr: 4.23e-03, grad_scale: 32.0 2023-06-25 02:05:17,662 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.92 vs. limit=15.0 2023-06-25 02:05:47,265 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1232622.0, ans=0.2 2023-06-25 02:06:26,666 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1232742.0, ans=0.125 2023-06-25 02:06:30,721 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.34 vs. limit=15.0 2023-06-25 02:06:36,030 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.80 vs. limit=15.0 2023-06-25 02:06:43,944 INFO [train.py:996] (0/4) Epoch 7, batch 22500, loss[loss=0.2207, simple_loss=0.2802, pruned_loss=0.08058, over 21521.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.2886, pruned_loss=0.0739, over 4270666.79 frames. ], batch size: 391, lr: 4.23e-03, grad_scale: 32.0 2023-06-25 02:06:46,183 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1232802.0, ans=0.0 2023-06-25 02:07:06,021 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.93 vs. limit=15.0 2023-06-25 02:07:15,536 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1232862.0, ans=0.025 2023-06-25 02:08:17,655 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 02:08:22,740 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.325e+02 2.995e+02 3.831e+02 4.510e+02 7.998e+02, threshold=7.663e+02, percent-clipped=9.0 2023-06-25 02:08:30,492 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1233042.0, ans=0.0 2023-06-25 02:08:32,958 INFO [train.py:996] (0/4) Epoch 7, batch 22550, loss[loss=0.2084, simple_loss=0.2872, pruned_loss=0.06477, over 21822.00 frames. ], tot_loss[loss=0.221, simple_loss=0.2938, pruned_loss=0.07408, over 4277357.32 frames. ], batch size: 298, lr: 4.23e-03, grad_scale: 32.0 2023-06-25 02:08:37,207 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1233102.0, ans=0.125 2023-06-25 02:09:28,218 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1233222.0, ans=0.125 2023-06-25 02:10:25,383 INFO [train.py:996] (0/4) Epoch 7, batch 22600, loss[loss=0.1921, simple_loss=0.2472, pruned_loss=0.06847, over 20404.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.2948, pruned_loss=0.07417, over 4271621.06 frames. ], batch size: 703, lr: 4.23e-03, grad_scale: 16.0 2023-06-25 02:10:26,115 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1233402.0, ans=0.1 2023-06-25 02:10:43,930 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1233402.0, ans=0.2 2023-06-25 02:11:27,839 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.96 vs. limit=15.0 2023-06-25 02:11:29,378 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.90 vs. limit=15.0 2023-06-25 02:11:41,492 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1233582.0, ans=0.1 2023-06-25 02:11:50,343 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 02:12:10,429 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.462e+02 3.222e+02 3.850e+02 5.288e+02 1.031e+03, threshold=7.700e+02, percent-clipped=4.0 2023-06-25 02:12:14,441 INFO [train.py:996] (0/4) Epoch 7, batch 22650, loss[loss=0.2067, simple_loss=0.2711, pruned_loss=0.07114, over 21668.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.2912, pruned_loss=0.07367, over 4272775.80 frames. ], batch size: 333, lr: 4.23e-03, grad_scale: 16.0 2023-06-25 02:13:47,289 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.58 vs. limit=8.0 2023-06-25 02:13:51,564 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1233942.0, ans=0.1 2023-06-25 02:14:01,938 INFO [train.py:996] (0/4) Epoch 7, batch 22700, loss[loss=0.1907, simple_loss=0.2578, pruned_loss=0.06179, over 21723.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2867, pruned_loss=0.0729, over 4253378.47 frames. ], batch size: 124, lr: 4.23e-03, grad_scale: 16.0 2023-06-25 02:14:19,598 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1234002.0, ans=0.125 2023-06-25 02:14:26,024 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 02:14:31,323 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1234062.0, ans=0.125 2023-06-25 02:15:03,653 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.83 vs. limit=22.5 2023-06-25 02:15:31,195 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1234242.0, ans=0.125 2023-06-25 02:15:45,756 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1234242.0, ans=0.125 2023-06-25 02:15:46,862 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.360e+02 3.300e+02 4.052e+02 5.642e+02 1.079e+03, threshold=8.104e+02, percent-clipped=7.0 2023-06-25 02:15:49,902 INFO [train.py:996] (0/4) Epoch 7, batch 22750, loss[loss=0.2579, simple_loss=0.3266, pruned_loss=0.09462, over 21595.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2888, pruned_loss=0.07398, over 4263268.11 frames. ], batch size: 414, lr: 4.23e-03, grad_scale: 16.0 2023-06-25 02:16:09,043 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1234302.0, ans=0.0 2023-06-25 02:16:54,928 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.36 vs. limit=10.0 2023-06-25 02:17:15,379 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1234482.0, ans=0.0 2023-06-25 02:17:20,603 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1234542.0, ans=0.125 2023-06-25 02:17:36,766 INFO [train.py:996] (0/4) Epoch 7, batch 22800, loss[loss=0.2079, simple_loss=0.2732, pruned_loss=0.07137, over 21508.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.2922, pruned_loss=0.07608, over 4275611.14 frames. ], batch size: 548, lr: 4.23e-03, grad_scale: 32.0 2023-06-25 02:18:04,812 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1234662.0, ans=0.0 2023-06-25 02:18:17,259 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1234662.0, ans=0.0 2023-06-25 02:18:18,006 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.41 vs. limit=15.0 2023-06-25 02:18:18,843 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1234662.0, ans=0.125 2023-06-25 02:18:22,432 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1234722.0, ans=0.125 2023-06-25 02:18:39,718 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1234722.0, ans=0.1 2023-06-25 02:19:23,107 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.424e+02 3.142e+02 3.789e+02 4.718e+02 7.259e+02, threshold=7.578e+02, percent-clipped=0.0 2023-06-25 02:19:24,724 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.70 vs. limit=15.0 2023-06-25 02:19:25,135 INFO [train.py:996] (0/4) Epoch 7, batch 22850, loss[loss=0.1731, simple_loss=0.2515, pruned_loss=0.04738, over 19964.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.2884, pruned_loss=0.07531, over 4282869.51 frames. ], batch size: 704, lr: 4.23e-03, grad_scale: 16.0 2023-06-25 02:19:57,033 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1234962.0, ans=0.035 2023-06-25 02:20:03,575 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1235022.0, ans=0.0 2023-06-25 02:20:21,964 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1235022.0, ans=0.125 2023-06-25 02:20:46,672 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1235082.0, ans=0.125 2023-06-25 02:20:50,736 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1235142.0, ans=0.125 2023-06-25 02:21:09,530 INFO [train.py:996] (0/4) Epoch 7, batch 22900, loss[loss=0.2155, simple_loss=0.2968, pruned_loss=0.06712, over 21196.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.2906, pruned_loss=0.07516, over 4282993.42 frames. ], batch size: 159, lr: 4.23e-03, grad_scale: 16.0 2023-06-25 02:21:13,365 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1235202.0, ans=0.1 2023-06-25 02:21:38,948 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1235262.0, ans=0.125 2023-06-25 02:22:04,520 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1235322.0, ans=0.125 2023-06-25 02:22:33,504 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1235382.0, ans=0.0 2023-06-25 02:22:44,424 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.62 vs. limit=15.0 2023-06-25 02:22:47,649 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1235442.0, ans=0.1 2023-06-25 02:23:04,028 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.336e+02 3.455e+02 4.744e+02 6.371e+02 1.430e+03, threshold=9.487e+02, percent-clipped=13.0 2023-06-25 02:23:05,583 INFO [train.py:996] (0/4) Epoch 7, batch 22950, loss[loss=0.2326, simple_loss=0.3555, pruned_loss=0.05484, over 21242.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.301, pruned_loss=0.07335, over 4278672.99 frames. ], batch size: 548, lr: 4.23e-03, grad_scale: 16.0 2023-06-25 02:23:42,548 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1235562.0, ans=0.125 2023-06-25 02:24:07,457 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.21 vs. limit=6.0 2023-06-25 02:24:11,215 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.82 vs. limit=10.0 2023-06-25 02:24:39,735 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1235742.0, ans=0.2 2023-06-25 02:24:53,116 INFO [train.py:996] (0/4) Epoch 7, batch 23000, loss[loss=0.2219, simple_loss=0.289, pruned_loss=0.07738, over 21565.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.3017, pruned_loss=0.07185, over 4277696.88 frames. ], batch size: 548, lr: 4.23e-03, grad_scale: 16.0 2023-06-25 02:25:34,017 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1235922.0, ans=0.0 2023-06-25 02:26:00,667 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1235982.0, ans=0.0 2023-06-25 02:26:06,233 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 02:26:31,085 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1236042.0, ans=0.2 2023-06-25 02:26:40,408 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.181e+02 3.049e+02 3.858e+02 4.759e+02 9.781e+02, threshold=7.716e+02, percent-clipped=2.0 2023-06-25 02:26:42,794 INFO [train.py:996] (0/4) Epoch 7, batch 23050, loss[loss=0.2969, simple_loss=0.3527, pruned_loss=0.1206, over 21357.00 frames. ], tot_loss[loss=0.226, simple_loss=0.3042, pruned_loss=0.07392, over 4284852.07 frames. ], batch size: 507, lr: 4.23e-03, grad_scale: 16.0 2023-06-25 02:26:50,187 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1236102.0, ans=0.125 2023-06-25 02:27:01,998 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1236102.0, ans=0.125 2023-06-25 02:28:31,615 INFO [train.py:996] (0/4) Epoch 7, batch 23100, loss[loss=0.1896, simple_loss=0.2491, pruned_loss=0.06505, over 21205.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.3004, pruned_loss=0.074, over 4267632.63 frames. ], batch size: 176, lr: 4.23e-03, grad_scale: 16.0 2023-06-25 02:28:47,084 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1236402.0, ans=0.125 2023-06-25 02:28:52,044 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1236402.0, ans=0.125 2023-06-25 02:28:55,153 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1236462.0, ans=0.125 2023-06-25 02:28:57,673 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.90 vs. limit=15.0 2023-06-25 02:30:16,717 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.225e+02 3.056e+02 3.591e+02 4.604e+02 9.748e+02, threshold=7.182e+02, percent-clipped=1.0 2023-06-25 02:30:18,329 INFO [train.py:996] (0/4) Epoch 7, batch 23150, loss[loss=0.2318, simple_loss=0.2923, pruned_loss=0.08566, over 21754.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.2951, pruned_loss=0.07313, over 4265828.96 frames. ], batch size: 441, lr: 4.23e-03, grad_scale: 16.0 2023-06-25 02:30:22,365 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.63 vs. limit=15.0 2023-06-25 02:30:26,843 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1236702.0, ans=0.1 2023-06-25 02:30:34,110 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.47 vs. limit=22.5 2023-06-25 02:31:27,355 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1236882.0, ans=0.125 2023-06-25 02:31:37,562 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1236942.0, ans=0.1 2023-06-25 02:31:44,332 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1236942.0, ans=0.0 2023-06-25 02:31:51,032 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1236942.0, ans=0.1 2023-06-25 02:32:03,906 INFO [train.py:996] (0/4) Epoch 7, batch 23200, loss[loss=0.2131, simple_loss=0.2786, pruned_loss=0.07384, over 21488.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.2939, pruned_loss=0.07349, over 4271238.61 frames. ], batch size: 194, lr: 4.23e-03, grad_scale: 32.0 2023-06-25 02:33:32,546 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1237242.0, ans=0.2 2023-06-25 02:33:34,188 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1237242.0, ans=0.05 2023-06-25 02:33:35,872 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1237242.0, ans=0.125 2023-06-25 02:33:52,459 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.409e+02 3.126e+02 3.728e+02 5.060e+02 1.069e+03, threshold=7.456e+02, percent-clipped=4.0 2023-06-25 02:33:52,493 INFO [train.py:996] (0/4) Epoch 7, batch 23250, loss[loss=0.227, simple_loss=0.2975, pruned_loss=0.0782, over 21511.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.294, pruned_loss=0.07512, over 4285806.25 frames. ], batch size: 131, lr: 4.22e-03, grad_scale: 16.0 2023-06-25 02:34:34,858 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 02:34:35,479 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.97 vs. limit=6.0 2023-06-25 02:35:38,897 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1237542.0, ans=0.2 2023-06-25 02:35:43,502 INFO [train.py:996] (0/4) Epoch 7, batch 23300, loss[loss=0.2483, simple_loss=0.3616, pruned_loss=0.06749, over 21815.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.3016, pruned_loss=0.07689, over 4281949.01 frames. ], batch size: 316, lr: 4.22e-03, grad_scale: 16.0 2023-06-25 02:36:33,115 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1237722.0, ans=0.0 2023-06-25 02:37:23,043 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1237842.0, ans=0.125 2023-06-25 02:37:39,006 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.304e+02 3.209e+02 3.833e+02 5.523e+02 1.342e+03, threshold=7.666e+02, percent-clipped=15.0 2023-06-25 02:37:39,038 INFO [train.py:996] (0/4) Epoch 7, batch 23350, loss[loss=0.2291, simple_loss=0.3192, pruned_loss=0.06949, over 21711.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.3038, pruned_loss=0.07565, over 4283087.43 frames. ], batch size: 332, lr: 4.22e-03, grad_scale: 16.0 2023-06-25 02:37:55,677 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1237902.0, ans=0.0 2023-06-25 02:38:11,789 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.40 vs. limit=6.0 2023-06-25 02:38:56,079 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1238082.0, ans=0.125 2023-06-25 02:38:59,285 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1238082.0, ans=0.2 2023-06-25 02:39:33,718 INFO [train.py:996] (0/4) Epoch 7, batch 23400, loss[loss=0.203, simple_loss=0.2766, pruned_loss=0.06469, over 21648.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.2953, pruned_loss=0.07166, over 4283546.21 frames. ], batch size: 263, lr: 4.22e-03, grad_scale: 16.0 2023-06-25 02:40:58,760 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1238442.0, ans=0.125 2023-06-25 02:41:13,704 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.03 vs. limit=15.0 2023-06-25 02:41:23,140 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.927e+02 3.153e+02 4.336e+02 5.410e+02 1.099e+03, threshold=8.672e+02, percent-clipped=12.0 2023-06-25 02:41:23,178 INFO [train.py:996] (0/4) Epoch 7, batch 23450, loss[loss=0.165, simple_loss=0.2679, pruned_loss=0.03109, over 20758.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.297, pruned_loss=0.07323, over 4280752.88 frames. ], batch size: 607, lr: 4.22e-03, grad_scale: 16.0 2023-06-25 02:41:25,372 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1238502.0, ans=0.0 2023-06-25 02:42:03,336 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1238622.0, ans=0.04949747468305833 2023-06-25 02:42:04,868 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1238622.0, ans=0.04949747468305833 2023-06-25 02:42:29,081 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1238682.0, ans=0.015 2023-06-25 02:43:06,181 INFO [train.py:996] (0/4) Epoch 7, batch 23500, loss[loss=0.2201, simple_loss=0.2911, pruned_loss=0.07453, over 21882.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.2974, pruned_loss=0.07473, over 4291075.91 frames. ], batch size: 351, lr: 4.22e-03, grad_scale: 16.0 2023-06-25 02:44:42,026 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1239042.0, ans=0.0 2023-06-25 02:44:51,120 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1239042.0, ans=0.125 2023-06-25 02:44:53,879 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.391e+02 2.970e+02 3.465e+02 4.227e+02 7.885e+02, threshold=6.930e+02, percent-clipped=0.0 2023-06-25 02:44:53,929 INFO [train.py:996] (0/4) Epoch 7, batch 23550, loss[loss=0.2019, simple_loss=0.2648, pruned_loss=0.06949, over 21890.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.2929, pruned_loss=0.07431, over 4284910.08 frames. ], batch size: 373, lr: 4.22e-03, grad_scale: 16.0 2023-06-25 02:46:24,103 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1239342.0, ans=0.125 2023-06-25 02:46:27,380 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1239342.0, ans=0.125 2023-06-25 02:46:29,101 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1239342.0, ans=0.125 2023-06-25 02:46:33,549 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.37 vs. limit=15.0 2023-06-25 02:46:41,619 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1239402.0, ans=0.125 2023-06-25 02:46:42,572 INFO [train.py:996] (0/4) Epoch 7, batch 23600, loss[loss=0.2509, simple_loss=0.3254, pruned_loss=0.08824, over 21942.00 frames. ], tot_loss[loss=0.222, simple_loss=0.2936, pruned_loss=0.07516, over 4276499.04 frames. ], batch size: 372, lr: 4.22e-03, grad_scale: 32.0 2023-06-25 02:47:12,584 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1239462.0, ans=0.0 2023-06-25 02:47:20,942 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.48 vs. limit=15.0 2023-06-25 02:47:21,639 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1239462.0, ans=0.125 2023-06-25 02:47:34,270 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1239522.0, ans=0.0 2023-06-25 02:47:41,791 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.91 vs. limit=15.0 2023-06-25 02:48:07,846 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1239582.0, ans=0.125 2023-06-25 02:48:13,227 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1239642.0, ans=0.1 2023-06-25 02:48:18,764 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=1239642.0, ans=15.0 2023-06-25 02:48:28,070 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.331e+02 3.161e+02 4.117e+02 5.105e+02 1.053e+03, threshold=8.234e+02, percent-clipped=8.0 2023-06-25 02:48:28,118 INFO [train.py:996] (0/4) Epoch 7, batch 23650, loss[loss=0.2339, simple_loss=0.3104, pruned_loss=0.07864, over 21445.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.2941, pruned_loss=0.0736, over 4277413.46 frames. ], batch size: 194, lr: 4.22e-03, grad_scale: 32.0 2023-06-25 02:48:41,187 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1239702.0, ans=0.125 2023-06-25 02:48:57,353 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1239762.0, ans=0.125 2023-06-25 02:50:17,021 INFO [train.py:996] (0/4) Epoch 7, batch 23700, loss[loss=0.1971, simple_loss=0.2777, pruned_loss=0.05819, over 21418.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.2956, pruned_loss=0.07285, over 4272994.64 frames. ], batch size: 194, lr: 4.22e-03, grad_scale: 16.0 2023-06-25 02:51:01,069 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1240062.0, ans=0.0 2023-06-25 02:51:20,792 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1240122.0, ans=0.0 2023-06-25 02:51:44,610 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1240242.0, ans=0.125 2023-06-25 02:52:12,623 INFO [train.py:996] (0/4) Epoch 7, batch 23750, loss[loss=0.155, simple_loss=0.2441, pruned_loss=0.03292, over 21683.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.2973, pruned_loss=0.073, over 4272319.85 frames. ], batch size: 230, lr: 4.22e-03, grad_scale: 16.0 2023-06-25 02:52:14,424 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.255e+02 3.374e+02 3.894e+02 5.027e+02 8.477e+02, threshold=7.788e+02, percent-clipped=1.0 2023-06-25 02:52:51,089 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1240362.0, ans=0.0 2023-06-25 02:52:51,806 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.61 vs. limit=15.0 2023-06-25 02:52:54,483 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1240362.0, ans=0.1 2023-06-25 02:53:01,732 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1240422.0, ans=0.95 2023-06-25 02:53:19,671 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1240482.0, ans=0.125 2023-06-25 02:53:25,565 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=1240482.0, ans=15.0 2023-06-25 02:54:03,189 INFO [train.py:996] (0/4) Epoch 7, batch 23800, loss[loss=0.2682, simple_loss=0.3627, pruned_loss=0.0869, over 21772.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.2987, pruned_loss=0.07201, over 4269691.87 frames. ], batch size: 371, lr: 4.22e-03, grad_scale: 16.0 2023-06-25 02:54:15,227 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1240602.0, ans=0.1 2023-06-25 02:54:35,081 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1240662.0, ans=0.125 2023-06-25 02:54:51,835 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1240722.0, ans=0.125 2023-06-25 02:56:06,043 INFO [train.py:996] (0/4) Epoch 7, batch 23850, loss[loss=0.2483, simple_loss=0.3204, pruned_loss=0.08817, over 21275.00 frames. ], tot_loss[loss=0.227, simple_loss=0.3065, pruned_loss=0.07368, over 4269508.57 frames. ], batch size: 159, lr: 4.22e-03, grad_scale: 16.0 2023-06-25 02:56:07,971 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.236e+02 3.127e+02 4.092e+02 4.859e+02 9.689e+02, threshold=8.184e+02, percent-clipped=5.0 2023-06-25 02:56:29,947 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1240962.0, ans=0.125 2023-06-25 02:56:52,052 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1241022.0, ans=0.0 2023-06-25 02:57:42,255 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1241142.0, ans=0.0 2023-06-25 02:57:47,694 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.58 vs. limit=15.0 2023-06-25 02:57:55,515 INFO [train.py:996] (0/4) Epoch 7, batch 23900, loss[loss=0.2024, simple_loss=0.2849, pruned_loss=0.05992, over 21624.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3126, pruned_loss=0.0757, over 4268623.41 frames. ], batch size: 298, lr: 4.22e-03, grad_scale: 16.0 2023-06-25 02:58:06,959 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.75 vs. limit=22.5 2023-06-25 02:58:08,002 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1241202.0, ans=0.125 2023-06-25 02:58:12,955 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1241262.0, ans=0.125 2023-06-25 02:58:24,801 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1241262.0, ans=0.2 2023-06-25 02:59:01,259 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1241382.0, ans=0.0 2023-06-25 02:59:32,135 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1241442.0, ans=0.125 2023-06-25 02:59:38,295 INFO [train.py:996] (0/4) Epoch 7, batch 23950, loss[loss=0.2109, simple_loss=0.2715, pruned_loss=0.0751, over 21849.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3076, pruned_loss=0.07561, over 4258321.99 frames. ], batch size: 107, lr: 4.22e-03, grad_scale: 16.0 2023-06-25 02:59:39,938 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.625e+02 3.372e+02 4.562e+02 5.557e+02 1.074e+03, threshold=9.124e+02, percent-clipped=7.0 2023-06-25 02:59:53,046 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1241502.0, ans=0.125 2023-06-25 03:01:27,387 INFO [train.py:996] (0/4) Epoch 7, batch 24000, loss[loss=0.2364, simple_loss=0.3133, pruned_loss=0.07976, over 21773.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3088, pruned_loss=0.07826, over 4260488.45 frames. ], batch size: 247, lr: 4.22e-03, grad_scale: 32.0 2023-06-25 03:01:27,388 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-25 03:01:45,552 INFO [train.py:1028] (0/4) Epoch 7, validation: loss=0.2668, simple_loss=0.3629, pruned_loss=0.0854, over 1796401.00 frames. 2023-06-25 03:01:45,553 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23714MB 2023-06-25 03:02:29,819 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1241862.0, ans=0.2 2023-06-25 03:02:35,388 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.71 vs. limit=12.0 2023-06-25 03:03:11,975 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1241982.0, ans=0.2 2023-06-25 03:03:14,120 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.27 vs. limit=10.0 2023-06-25 03:03:17,141 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1242042.0, ans=0.2 2023-06-25 03:03:20,780 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1242042.0, ans=0.125 2023-06-25 03:03:35,969 INFO [train.py:996] (0/4) Epoch 7, batch 24050, loss[loss=0.2338, simple_loss=0.3021, pruned_loss=0.08273, over 20042.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3097, pruned_loss=0.07841, over 4264352.80 frames. ], batch size: 703, lr: 4.22e-03, grad_scale: 16.0 2023-06-25 03:03:39,409 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.322e+02 3.516e+02 4.440e+02 5.748e+02 1.093e+03, threshold=8.881e+02, percent-clipped=2.0 2023-06-25 03:03:39,961 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1242102.0, ans=0.0 2023-06-25 03:03:55,646 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1242162.0, ans=0.125 2023-06-25 03:04:44,197 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1242282.0, ans=0.0 2023-06-25 03:04:47,900 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1242282.0, ans=0.125 2023-06-25 03:05:20,286 INFO [train.py:996] (0/4) Epoch 7, batch 24100, loss[loss=0.2901, simple_loss=0.368, pruned_loss=0.106, over 21742.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3108, pruned_loss=0.07777, over 4263912.69 frames. ], batch size: 441, lr: 4.22e-03, grad_scale: 16.0 2023-06-25 03:05:22,537 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1242402.0, ans=0.0 2023-06-25 03:05:35,031 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1242402.0, ans=0.125 2023-06-25 03:05:48,888 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=1242462.0, ans=0.05 2023-06-25 03:06:05,730 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1242522.0, ans=0.0 2023-06-25 03:06:42,995 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1242582.0, ans=0.1 2023-06-25 03:06:57,725 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1242642.0, ans=0.0 2023-06-25 03:07:09,427 INFO [train.py:996] (0/4) Epoch 7, batch 24150, loss[loss=0.2203, simple_loss=0.2924, pruned_loss=0.0741, over 21856.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3112, pruned_loss=0.07966, over 4273304.36 frames. ], batch size: 371, lr: 4.22e-03, grad_scale: 16.0 2023-06-25 03:07:12,836 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.342e+02 3.235e+02 4.030e+02 4.867e+02 1.048e+03, threshold=8.060e+02, percent-clipped=3.0 2023-06-25 03:07:17,439 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1242702.0, ans=0.125 2023-06-25 03:07:42,815 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1242762.0, ans=0.125 2023-06-25 03:08:37,827 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1242942.0, ans=0.125 2023-06-25 03:08:53,061 INFO [train.py:996] (0/4) Epoch 7, batch 24200, loss[loss=0.2541, simple_loss=0.3351, pruned_loss=0.08659, over 21714.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3137, pruned_loss=0.08083, over 4273516.03 frames. ], batch size: 351, lr: 4.22e-03, grad_scale: 16.0 2023-06-25 03:09:49,984 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1243122.0, ans=0.07 2023-06-25 03:09:51,707 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1243122.0, ans=0.2 2023-06-25 03:09:53,859 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1243122.0, ans=0.2 2023-06-25 03:10:09,881 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1243182.0, ans=0.04949747468305833 2023-06-25 03:10:48,485 INFO [train.py:996] (0/4) Epoch 7, batch 24250, loss[loss=0.1756, simple_loss=0.2655, pruned_loss=0.04283, over 21373.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3098, pruned_loss=0.07438, over 4275406.25 frames. ], batch size: 194, lr: 4.21e-03, grad_scale: 16.0 2023-06-25 03:10:51,920 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.960e+02 3.061e+02 3.870e+02 4.839e+02 8.744e+02, threshold=7.741e+02, percent-clipped=3.0 2023-06-25 03:12:19,402 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1243542.0, ans=10.0 2023-06-25 03:12:31,269 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1243542.0, ans=0.125 2023-06-25 03:12:36,591 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1243602.0, ans=0.0 2023-06-25 03:12:38,085 INFO [train.py:996] (0/4) Epoch 7, batch 24300, loss[loss=0.1796, simple_loss=0.2823, pruned_loss=0.03841, over 21233.00 frames. ], tot_loss[loss=0.219, simple_loss=0.3013, pruned_loss=0.06835, over 4281320.80 frames. ], batch size: 548, lr: 4.21e-03, grad_scale: 16.0 2023-06-25 03:12:39,334 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.81 vs. limit=15.0 2023-06-25 03:13:13,709 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1243662.0, ans=0.035 2023-06-25 03:13:36,137 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1243722.0, ans=0.0 2023-06-25 03:13:57,191 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1243782.0, ans=0.0 2023-06-25 03:14:26,073 INFO [train.py:996] (0/4) Epoch 7, batch 24350, loss[loss=0.2056, simple_loss=0.2792, pruned_loss=0.06601, over 21263.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2984, pruned_loss=0.06882, over 4290729.60 frames. ], batch size: 143, lr: 4.21e-03, grad_scale: 16.0 2023-06-25 03:14:34,782 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.057e+02 2.804e+02 3.474e+02 4.596e+02 8.821e+02, threshold=6.948e+02, percent-clipped=1.0 2023-06-25 03:15:24,066 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.94 vs. limit=15.0 2023-06-25 03:15:41,276 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1244082.0, ans=0.1 2023-06-25 03:15:58,191 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1244142.0, ans=0.5 2023-06-25 03:16:20,443 INFO [train.py:996] (0/4) Epoch 7, batch 24400, loss[loss=0.2139, simple_loss=0.2976, pruned_loss=0.06508, over 21282.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.3022, pruned_loss=0.07204, over 4286964.04 frames. ], batch size: 159, lr: 4.21e-03, grad_scale: 32.0 2023-06-25 03:16:29,959 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1244202.0, ans=0.125 2023-06-25 03:16:35,975 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.88 vs. limit=5.0 2023-06-25 03:16:43,042 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1244262.0, ans=0.0 2023-06-25 03:16:49,360 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1244262.0, ans=0.0 2023-06-25 03:17:31,136 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.45 vs. limit=15.0 2023-06-25 03:17:56,441 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1244442.0, ans=0.1 2023-06-25 03:18:15,758 INFO [train.py:996] (0/4) Epoch 7, batch 24450, loss[loss=0.2549, simple_loss=0.3434, pruned_loss=0.08321, over 21766.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.3042, pruned_loss=0.07319, over 4282609.41 frames. ], batch size: 351, lr: 4.21e-03, grad_scale: 32.0 2023-06-25 03:18:19,268 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.592e+02 3.443e+02 3.965e+02 5.571e+02 1.139e+03, threshold=7.931e+02, percent-clipped=16.0 2023-06-25 03:19:52,512 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.96 vs. limit=15.0 2023-06-25 03:20:03,071 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.01 vs. limit=22.5 2023-06-25 03:20:03,685 INFO [train.py:996] (0/4) Epoch 7, batch 24500, loss[loss=0.2392, simple_loss=0.3043, pruned_loss=0.0871, over 21791.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.3047, pruned_loss=0.07322, over 4288024.20 frames. ], batch size: 441, lr: 4.21e-03, grad_scale: 16.0 2023-06-25 03:20:13,426 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1244802.0, ans=0.125 2023-06-25 03:20:45,363 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 03:21:27,248 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.56 vs. limit=15.0 2023-06-25 03:21:48,775 INFO [train.py:996] (0/4) Epoch 7, batch 24550, loss[loss=0.2745, simple_loss=0.3538, pruned_loss=0.0976, over 21582.00 frames. ], tot_loss[loss=0.228, simple_loss=0.3058, pruned_loss=0.0751, over 4284042.52 frames. ], batch size: 414, lr: 4.21e-03, grad_scale: 16.0 2023-06-25 03:21:53,855 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.291e+02 2.970e+02 3.569e+02 4.682e+02 1.145e+03, threshold=7.139e+02, percent-clipped=2.0 2023-06-25 03:22:48,185 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1245222.0, ans=0.05 2023-06-25 03:23:08,027 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1245282.0, ans=0.1 2023-06-25 03:23:20,528 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1245342.0, ans=0.125 2023-06-25 03:23:25,645 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1245342.0, ans=0.2 2023-06-25 03:23:28,634 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1245342.0, ans=0.0 2023-06-25 03:23:31,368 INFO [train.py:996] (0/4) Epoch 7, batch 24600, loss[loss=0.219, simple_loss=0.2857, pruned_loss=0.07612, over 21783.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.3015, pruned_loss=0.07499, over 4283538.63 frames. ], batch size: 352, lr: 4.21e-03, grad_scale: 16.0 2023-06-25 03:23:45,584 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1245402.0, ans=0.125 2023-06-25 03:24:26,966 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1245522.0, ans=0.125 2023-06-25 03:24:26,982 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 03:24:33,787 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1245582.0, ans=0.125 2023-06-25 03:25:14,534 INFO [train.py:996] (0/4) Epoch 7, batch 24650, loss[loss=0.2143, simple_loss=0.2683, pruned_loss=0.08019, over 21351.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.2945, pruned_loss=0.07422, over 4282054.67 frames. ], batch size: 473, lr: 4.21e-03, grad_scale: 16.0 2023-06-25 03:25:19,712 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.334e+02 3.258e+02 3.830e+02 5.672e+02 1.406e+03, threshold=7.660e+02, percent-clipped=13.0 2023-06-25 03:25:45,152 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.29 vs. limit=15.0 2023-06-25 03:26:10,643 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1245822.0, ans=0.0 2023-06-25 03:26:25,104 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.01 vs. limit=22.5 2023-06-25 03:27:02,269 INFO [train.py:996] (0/4) Epoch 7, batch 24700, loss[loss=0.1923, simple_loss=0.271, pruned_loss=0.0568, over 16987.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2916, pruned_loss=0.07207, over 4267198.70 frames. ], batch size: 67, lr: 4.21e-03, grad_scale: 16.0 2023-06-25 03:28:49,810 INFO [train.py:996] (0/4) Epoch 7, batch 24750, loss[loss=0.1847, simple_loss=0.2509, pruned_loss=0.05927, over 21480.00 frames. ], tot_loss[loss=0.2123, simple_loss=0.2855, pruned_loss=0.06951, over 4275960.10 frames. ], batch size: 132, lr: 4.21e-03, grad_scale: 16.0 2023-06-25 03:28:54,682 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.316e+02 2.901e+02 3.279e+02 4.785e+02 1.213e+03, threshold=6.557e+02, percent-clipped=5.0 2023-06-25 03:29:37,809 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1246422.0, ans=0.0 2023-06-25 03:29:51,710 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1246422.0, ans=0.0 2023-06-25 03:30:26,548 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1246542.0, ans=0.0 2023-06-25 03:30:35,863 INFO [train.py:996] (0/4) Epoch 7, batch 24800, loss[loss=0.214, simple_loss=0.2926, pruned_loss=0.06767, over 21869.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.281, pruned_loss=0.06959, over 4276075.02 frames. ], batch size: 124, lr: 4.21e-03, grad_scale: 32.0 2023-06-25 03:30:53,518 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.36 vs. limit=15.0 2023-06-25 03:31:11,822 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1246662.0, ans=0.125 2023-06-25 03:31:45,847 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1246782.0, ans=0.125 2023-06-25 03:32:23,883 INFO [train.py:996] (0/4) Epoch 7, batch 24850, loss[loss=0.2033, simple_loss=0.2779, pruned_loss=0.06436, over 21717.00 frames. ], tot_loss[loss=0.2119, simple_loss=0.2821, pruned_loss=0.07081, over 4278659.12 frames. ], batch size: 247, lr: 4.21e-03, grad_scale: 16.0 2023-06-25 03:32:30,860 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.467e+02 3.124e+02 3.906e+02 4.909e+02 9.613e+02, threshold=7.812e+02, percent-clipped=9.0 2023-06-25 03:33:21,851 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.75 vs. limit=22.5 2023-06-25 03:33:43,057 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1247082.0, ans=0.0 2023-06-25 03:34:05,600 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1247142.0, ans=0.1 2023-06-25 03:34:14,038 INFO [train.py:996] (0/4) Epoch 7, batch 24900, loss[loss=0.2305, simple_loss=0.3085, pruned_loss=0.07631, over 21545.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2854, pruned_loss=0.07213, over 4275913.44 frames. ], batch size: 194, lr: 4.21e-03, grad_scale: 16.0 2023-06-25 03:34:14,468 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1247202.0, ans=0.125 2023-06-25 03:34:16,253 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1247202.0, ans=0.1 2023-06-25 03:34:40,475 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1247262.0, ans=0.035 2023-06-25 03:34:49,891 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1247262.0, ans=0.125 2023-06-25 03:35:12,934 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1247322.0, ans=0.2 2023-06-25 03:36:08,376 INFO [train.py:996] (0/4) Epoch 7, batch 24950, loss[loss=0.2632, simple_loss=0.3382, pruned_loss=0.09407, over 21590.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.2918, pruned_loss=0.07547, over 4274923.47 frames. ], batch size: 389, lr: 4.21e-03, grad_scale: 16.0 2023-06-25 03:36:15,231 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.748e+02 3.765e+02 4.804e+02 6.774e+02 1.687e+03, threshold=9.608e+02, percent-clipped=17.0 2023-06-25 03:36:31,816 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1247562.0, ans=0.125 2023-06-25 03:37:18,492 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1247682.0, ans=0.2 2023-06-25 03:37:23,331 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1247682.0, ans=0.0 2023-06-25 03:37:27,109 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1247682.0, ans=0.125 2023-06-25 03:37:37,596 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1247742.0, ans=0.125 2023-06-25 03:37:57,861 INFO [train.py:996] (0/4) Epoch 7, batch 25000, loss[loss=0.2067, simple_loss=0.2787, pruned_loss=0.0674, over 21756.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3002, pruned_loss=0.07769, over 4276388.72 frames. ], batch size: 112, lr: 4.21e-03, grad_scale: 16.0 2023-06-25 03:38:14,694 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1247802.0, ans=0.2 2023-06-25 03:39:00,509 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1247922.0, ans=0.0 2023-06-25 03:39:10,698 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-208000.pt 2023-06-25 03:39:47,804 INFO [train.py:996] (0/4) Epoch 7, batch 25050, loss[loss=0.216, simple_loss=0.2717, pruned_loss=0.08018, over 21836.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.2934, pruned_loss=0.07589, over 4271507.30 frames. ], batch size: 373, lr: 4.21e-03, grad_scale: 16.0 2023-06-25 03:39:59,669 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.536e+02 3.278e+02 3.984e+02 5.261e+02 1.222e+03, threshold=7.967e+02, percent-clipped=1.0 2023-06-25 03:40:13,903 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1248162.0, ans=0.0 2023-06-25 03:40:41,270 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.91 vs. limit=15.0 2023-06-25 03:41:15,657 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1248342.0, ans=0.1 2023-06-25 03:41:35,613 INFO [train.py:996] (0/4) Epoch 7, batch 25100, loss[loss=0.1885, simple_loss=0.2628, pruned_loss=0.0571, over 21222.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.2875, pruned_loss=0.07477, over 4272961.59 frames. ], batch size: 548, lr: 4.21e-03, grad_scale: 16.0 2023-06-25 03:41:55,093 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1248402.0, ans=0.2 2023-06-25 03:42:47,694 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1248582.0, ans=0.2 2023-06-25 03:43:15,191 INFO [train.py:996] (0/4) Epoch 7, batch 25150, loss[loss=0.211, simple_loss=0.2835, pruned_loss=0.06925, over 17851.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.291, pruned_loss=0.07272, over 4272543.22 frames. ], batch size: 69, lr: 4.21e-03, grad_scale: 16.0 2023-06-25 03:43:22,417 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.352e+02 2.917e+02 3.507e+02 4.290e+02 7.134e+02, threshold=7.014e+02, percent-clipped=0.0 2023-06-25 03:43:28,201 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1248702.0, ans=0.125 2023-06-25 03:44:03,915 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1248822.0, ans=0.1 2023-06-25 03:45:03,188 INFO [train.py:996] (0/4) Epoch 7, batch 25200, loss[loss=0.2146, simple_loss=0.3239, pruned_loss=0.05268, over 20859.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2905, pruned_loss=0.07093, over 4263338.28 frames. ], batch size: 608, lr: 4.21e-03, grad_scale: 32.0 2023-06-25 03:45:16,557 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.07 vs. limit=22.5 2023-06-25 03:45:29,998 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1249062.0, ans=0.1 2023-06-25 03:46:28,458 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.50 vs. limit=6.0 2023-06-25 03:46:43,570 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1249302.0, ans=0.125 2023-06-25 03:46:44,516 INFO [train.py:996] (0/4) Epoch 7, batch 25250, loss[loss=0.189, simple_loss=0.2736, pruned_loss=0.05223, over 21640.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.289, pruned_loss=0.0691, over 4274468.49 frames. ], batch size: 263, lr: 4.20e-03, grad_scale: 32.0 2023-06-25 03:46:50,725 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.310e+02 3.493e+02 4.531e+02 6.299e+02 1.264e+03, threshold=9.062e+02, percent-clipped=19.0 2023-06-25 03:46:51,870 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=1249302.0, ans=6.0 2023-06-25 03:46:53,084 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1249302.0, ans=0.0 2023-06-25 03:47:27,624 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1249422.0, ans=0.125 2023-06-25 03:48:04,598 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1249482.0, ans=0.0 2023-06-25 03:48:32,333 INFO [train.py:996] (0/4) Epoch 7, batch 25300, loss[loss=0.2319, simple_loss=0.311, pruned_loss=0.07642, over 21781.00 frames. ], tot_loss[loss=0.211, simple_loss=0.2855, pruned_loss=0.06828, over 4262851.11 frames. ], batch size: 332, lr: 4.20e-03, grad_scale: 32.0 2023-06-25 03:49:09,169 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1249662.0, ans=0.125 2023-06-25 03:49:09,230 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1249662.0, ans=0.1 2023-06-25 03:49:31,652 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1249722.0, ans=0.0 2023-06-25 03:49:41,073 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1249782.0, ans=0.125 2023-06-25 03:50:14,488 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.92 vs. limit=22.5 2023-06-25 03:50:15,624 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1249842.0, ans=0.125 2023-06-25 03:50:18,322 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.38 vs. limit=22.5 2023-06-25 03:50:20,518 INFO [train.py:996] (0/4) Epoch 7, batch 25350, loss[loss=0.1553, simple_loss=0.2251, pruned_loss=0.04274, over 16784.00 frames. ], tot_loss[loss=0.2108, simple_loss=0.2866, pruned_loss=0.06754, over 4248654.67 frames. ], batch size: 61, lr: 4.20e-03, grad_scale: 16.0 2023-06-25 03:50:29,462 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.308e+02 2.853e+02 3.365e+02 4.532e+02 7.857e+02, threshold=6.730e+02, percent-clipped=0.0 2023-06-25 03:50:39,652 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1249902.0, ans=0.125 2023-06-25 03:50:42,105 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.69 vs. limit=15.0 2023-06-25 03:51:20,024 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1250082.0, ans=0.2 2023-06-25 03:52:03,100 INFO [train.py:996] (0/4) Epoch 7, batch 25400, loss[loss=0.1681, simple_loss=0.2582, pruned_loss=0.03898, over 21619.00 frames. ], tot_loss[loss=0.2089, simple_loss=0.2837, pruned_loss=0.06702, over 4241905.81 frames. ], batch size: 263, lr: 4.20e-03, grad_scale: 16.0 2023-06-25 03:53:46,505 INFO [train.py:996] (0/4) Epoch 7, batch 25450, loss[loss=0.2017, simple_loss=0.2946, pruned_loss=0.05442, over 21320.00 frames. ], tot_loss[loss=0.211, simple_loss=0.2841, pruned_loss=0.06892, over 4253502.96 frames. ], batch size: 159, lr: 4.20e-03, grad_scale: 16.0 2023-06-25 03:53:55,092 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.119e+02 2.979e+02 3.775e+02 5.252e+02 7.977e+02, threshold=7.549e+02, percent-clipped=6.0 2023-06-25 03:54:25,430 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1250562.0, ans=0.1 2023-06-25 03:54:45,940 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1250682.0, ans=0.2 2023-06-25 03:55:32,125 INFO [train.py:996] (0/4) Epoch 7, batch 25500, loss[loss=0.164, simple_loss=0.2615, pruned_loss=0.03326, over 21632.00 frames. ], tot_loss[loss=0.2073, simple_loss=0.2835, pruned_loss=0.06554, over 4260606.20 frames. ], batch size: 263, lr: 4.20e-03, grad_scale: 16.0 2023-06-25 03:55:34,182 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1250802.0, ans=0.125 2023-06-25 03:56:06,679 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1250862.0, ans=0.125 2023-06-25 03:56:15,422 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1250862.0, ans=0.125 2023-06-25 03:56:17,121 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 03:56:17,267 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1250922.0, ans=0.125 2023-06-25 03:56:21,067 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1250922.0, ans=0.0 2023-06-25 03:56:41,066 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=1250982.0, ans=10.0 2023-06-25 03:57:17,496 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1251042.0, ans=0.125 2023-06-25 03:57:27,552 INFO [train.py:996] (0/4) Epoch 7, batch 25550, loss[loss=0.1942, simple_loss=0.2889, pruned_loss=0.04974, over 21548.00 frames. ], tot_loss[loss=0.2118, simple_loss=0.2915, pruned_loss=0.06604, over 4264520.32 frames. ], batch size: 230, lr: 4.20e-03, grad_scale: 16.0 2023-06-25 03:57:41,641 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.140e+02 3.132e+02 4.314e+02 5.832e+02 9.037e+02, threshold=8.627e+02, percent-clipped=4.0 2023-06-25 03:58:03,461 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1251162.0, ans=0.125 2023-06-25 03:59:21,929 INFO [train.py:996] (0/4) Epoch 7, batch 25600, loss[loss=0.2151, simple_loss=0.293, pruned_loss=0.0686, over 21640.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.296, pruned_loss=0.06729, over 4266450.01 frames. ], batch size: 230, lr: 4.20e-03, grad_scale: 32.0 2023-06-25 03:59:52,432 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1251462.0, ans=0.2 2023-06-25 04:00:34,755 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1251582.0, ans=0.0 2023-06-25 04:00:36,365 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1251582.0, ans=0.0 2023-06-25 04:00:59,717 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=1251642.0, ans=15.0 2023-06-25 04:01:02,377 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1251642.0, ans=0.125 2023-06-25 04:01:09,221 INFO [train.py:996] (0/4) Epoch 7, batch 25650, loss[loss=0.2125, simple_loss=0.2858, pruned_loss=0.06964, over 21761.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.2962, pruned_loss=0.06965, over 4265013.58 frames. ], batch size: 124, lr: 4.20e-03, grad_scale: 16.0 2023-06-25 04:01:19,264 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.352e+02 3.050e+02 3.577e+02 4.545e+02 8.924e+02, threshold=7.154e+02, percent-clipped=2.0 2023-06-25 04:02:42,501 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.73 vs. limit=15.0 2023-06-25 04:02:54,043 INFO [train.py:996] (0/4) Epoch 7, batch 25700, loss[loss=0.267, simple_loss=0.3973, pruned_loss=0.06832, over 19832.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2936, pruned_loss=0.07108, over 4263504.29 frames. ], batch size: 702, lr: 4.20e-03, grad_scale: 16.0 2023-06-25 04:03:08,747 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 04:03:30,307 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.92 vs. limit=15.0 2023-06-25 04:03:47,071 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1252122.0, ans=0.04949747468305833 2023-06-25 04:04:06,935 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.11 vs. limit=15.0 2023-06-25 04:04:22,025 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1252182.0, ans=0.0 2023-06-25 04:04:43,956 INFO [train.py:996] (0/4) Epoch 7, batch 25750, loss[loss=0.2114, simple_loss=0.2783, pruned_loss=0.07224, over 19949.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.2991, pruned_loss=0.07415, over 4268424.01 frames. ], batch size: 702, lr: 4.20e-03, grad_scale: 16.0 2023-06-25 04:04:52,285 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1252302.0, ans=0.125 2023-06-25 04:04:55,427 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.453e+02 3.207e+02 3.828e+02 5.534e+02 9.207e+02, threshold=7.655e+02, percent-clipped=4.0 2023-06-25 04:05:28,635 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.29 vs. limit=15.0 2023-06-25 04:06:16,015 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.71 vs. limit=15.0 2023-06-25 04:06:41,409 INFO [train.py:996] (0/4) Epoch 7, batch 25800, loss[loss=0.2708, simple_loss=0.3465, pruned_loss=0.09751, over 21378.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3097, pruned_loss=0.07815, over 4268840.92 frames. ], batch size: 159, lr: 4.20e-03, grad_scale: 16.0 2023-06-25 04:07:07,591 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=6.10 vs. limit=22.5 2023-06-25 04:07:09,081 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1252662.0, ans=0.125 2023-06-25 04:07:57,852 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1252782.0, ans=0.1 2023-06-25 04:08:04,959 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=1252782.0, ans=10.0 2023-06-25 04:08:10,920 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.81 vs. limit=15.0 2023-06-25 04:08:36,054 INFO [train.py:996] (0/4) Epoch 7, batch 25850, loss[loss=0.217, simple_loss=0.2898, pruned_loss=0.07216, over 21491.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3112, pruned_loss=0.07746, over 4275396.78 frames. ], batch size: 548, lr: 4.20e-03, grad_scale: 16.0 2023-06-25 04:08:46,063 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.468e+02 3.799e+02 4.980e+02 7.138e+02 1.041e+03, threshold=9.960e+02, percent-clipped=14.0 2023-06-25 04:08:52,325 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 04:09:12,788 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1252962.0, ans=0.125 2023-06-25 04:09:32,128 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1253022.0, ans=0.125 2023-06-25 04:10:20,239 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1253142.0, ans=0.0 2023-06-25 04:10:22,284 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.29 vs. limit=6.0 2023-06-25 04:10:24,679 INFO [train.py:996] (0/4) Epoch 7, batch 25900, loss[loss=0.2576, simple_loss=0.3493, pruned_loss=0.08294, over 21612.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3131, pruned_loss=0.07857, over 4278929.76 frames. ], batch size: 263, lr: 4.20e-03, grad_scale: 16.0 2023-06-25 04:11:06,772 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1253262.0, ans=0.035 2023-06-25 04:11:10,618 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1253322.0, ans=0.1 2023-06-25 04:11:14,616 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1253322.0, ans=0.125 2023-06-25 04:11:18,159 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1253322.0, ans=0.125 2023-06-25 04:11:44,738 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1253382.0, ans=0.125 2023-06-25 04:11:53,913 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1253442.0, ans=0.125 2023-06-25 04:12:11,671 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1253442.0, ans=0.0 2023-06-25 04:12:19,423 INFO [train.py:996] (0/4) Epoch 7, batch 25950, loss[loss=0.2471, simple_loss=0.3276, pruned_loss=0.08327, over 21850.00 frames. ], tot_loss[loss=0.241, simple_loss=0.3192, pruned_loss=0.08134, over 4283181.78 frames. ], batch size: 118, lr: 4.20e-03, grad_scale: 16.0 2023-06-25 04:12:30,260 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.260e+02 3.924e+02 4.825e+02 6.667e+02 9.345e+02, threshold=9.651e+02, percent-clipped=0.0 2023-06-25 04:12:41,791 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=6.61 vs. limit=22.5 2023-06-25 04:12:47,788 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1253562.0, ans=0.125 2023-06-25 04:12:50,213 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.52 vs. limit=22.5 2023-06-25 04:13:19,173 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1253682.0, ans=0.1 2023-06-25 04:13:34,020 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1253682.0, ans=0.1 2023-06-25 04:14:08,597 INFO [train.py:996] (0/4) Epoch 7, batch 26000, loss[loss=0.3157, simple_loss=0.3807, pruned_loss=0.1253, over 21356.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3188, pruned_loss=0.0799, over 4283835.71 frames. ], batch size: 507, lr: 4.20e-03, grad_scale: 32.0 2023-06-25 04:14:42,038 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1253862.0, ans=0.2 2023-06-25 04:15:36,735 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.09 vs. limit=22.5 2023-06-25 04:15:50,291 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1254042.0, ans=0.1 2023-06-25 04:15:57,124 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1254102.0, ans=0.0 2023-06-25 04:15:58,149 INFO [train.py:996] (0/4) Epoch 7, batch 26050, loss[loss=0.2853, simple_loss=0.329, pruned_loss=0.1208, over 21675.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3196, pruned_loss=0.0814, over 4283887.59 frames. ], batch size: 507, lr: 4.20e-03, grad_scale: 16.0 2023-06-25 04:16:10,016 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.358e+02 3.188e+02 3.821e+02 5.430e+02 8.574e+02, threshold=7.643e+02, percent-clipped=0.0 2023-06-25 04:16:25,432 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1254162.0, ans=0.0 2023-06-25 04:16:56,760 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1254222.0, ans=0.1 2023-06-25 04:17:23,847 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1254342.0, ans=0.125 2023-06-25 04:17:45,906 INFO [train.py:996] (0/4) Epoch 7, batch 26100, loss[loss=0.2435, simple_loss=0.3082, pruned_loss=0.08943, over 21376.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.3142, pruned_loss=0.08053, over 4282053.27 frames. ], batch size: 143, lr: 4.20e-03, grad_scale: 16.0 2023-06-25 04:17:57,301 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.33 vs. limit=15.0 2023-06-25 04:19:35,052 INFO [train.py:996] (0/4) Epoch 7, batch 26150, loss[loss=0.2594, simple_loss=0.3236, pruned_loss=0.09761, over 21836.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3114, pruned_loss=0.08051, over 4290833.74 frames. ], batch size: 441, lr: 4.20e-03, grad_scale: 16.0 2023-06-25 04:19:47,504 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.460e+02 3.240e+02 3.858e+02 5.306e+02 8.605e+02, threshold=7.716e+02, percent-clipped=2.0 2023-06-25 04:20:02,179 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1254762.0, ans=0.1 2023-06-25 04:20:45,845 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1254882.0, ans=0.2 2023-06-25 04:20:49,794 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.27 vs. limit=15.0 2023-06-25 04:21:06,902 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1254942.0, ans=0.125 2023-06-25 04:21:24,108 INFO [train.py:996] (0/4) Epoch 7, batch 26200, loss[loss=0.2103, simple_loss=0.3091, pruned_loss=0.0558, over 21495.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3117, pruned_loss=0.07875, over 4285215.71 frames. ], batch size: 211, lr: 4.20e-03, grad_scale: 16.0 2023-06-25 04:21:35,542 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1255002.0, ans=0.2 2023-06-25 04:22:10,642 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=1255122.0, ans=10.0 2023-06-25 04:23:13,413 INFO [train.py:996] (0/4) Epoch 7, batch 26250, loss[loss=0.2169, simple_loss=0.2918, pruned_loss=0.07103, over 21356.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3159, pruned_loss=0.07814, over 4284866.23 frames. ], batch size: 176, lr: 4.19e-03, grad_scale: 16.0 2023-06-25 04:23:19,090 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1255302.0, ans=0.125 2023-06-25 04:23:25,312 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.169e+02 3.172e+02 3.762e+02 4.925e+02 1.309e+03, threshold=7.524e+02, percent-clipped=5.0 2023-06-25 04:23:37,781 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1255362.0, ans=0.0 2023-06-25 04:24:07,231 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1255422.0, ans=0.1 2023-06-25 04:24:59,662 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1255602.0, ans=0.125 2023-06-25 04:25:01,097 INFO [train.py:996] (0/4) Epoch 7, batch 26300, loss[loss=0.2293, simple_loss=0.302, pruned_loss=0.07835, over 21869.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3119, pruned_loss=0.07858, over 4295208.72 frames. ], batch size: 107, lr: 4.19e-03, grad_scale: 16.0 2023-06-25 04:25:08,701 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1255602.0, ans=0.1 2023-06-25 04:25:44,379 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1255662.0, ans=0.1 2023-06-25 04:25:47,811 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1255722.0, ans=0.0 2023-06-25 04:26:34,597 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1255842.0, ans=0.125 2023-06-25 04:26:45,065 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1255842.0, ans=0.125 2023-06-25 04:26:53,863 INFO [train.py:996] (0/4) Epoch 7, batch 26350, loss[loss=0.2519, simple_loss=0.3246, pruned_loss=0.08959, over 21389.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3098, pruned_loss=0.07864, over 4296312.09 frames. ], batch size: 159, lr: 4.19e-03, grad_scale: 16.0 2023-06-25 04:26:56,309 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1255902.0, ans=0.1 2023-06-25 04:26:57,954 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1255902.0, ans=0.125 2023-06-25 04:27:11,529 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.447e+02 3.110e+02 3.681e+02 4.505e+02 7.991e+02, threshold=7.361e+02, percent-clipped=2.0 2023-06-25 04:27:27,507 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1255962.0, ans=0.05 2023-06-25 04:27:28,085 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.90 vs. limit=15.0 2023-06-25 04:27:59,521 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1256082.0, ans=0.125 2023-06-25 04:28:40,463 INFO [train.py:996] (0/4) Epoch 7, batch 26400, loss[loss=0.2451, simple_loss=0.2846, pruned_loss=0.1028, over 21336.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3048, pruned_loss=0.07872, over 4294622.37 frames. ], batch size: 507, lr: 4.19e-03, grad_scale: 32.0 2023-06-25 04:28:41,168 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1256202.0, ans=0.2 2023-06-25 04:28:46,740 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 04:29:10,982 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1256262.0, ans=0.1 2023-06-25 04:29:41,953 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=1256322.0, ans=0.5 2023-06-25 04:29:44,479 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1256322.0, ans=0.125 2023-06-25 04:30:03,231 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.30 vs. limit=15.0 2023-06-25 04:30:39,821 INFO [train.py:996] (0/4) Epoch 7, batch 26450, loss[loss=0.2296, simple_loss=0.32, pruned_loss=0.06965, over 21426.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3055, pruned_loss=0.0786, over 4289434.39 frames. ], batch size: 211, lr: 4.19e-03, grad_scale: 32.0 2023-06-25 04:30:48,493 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.44 vs. limit=22.5 2023-06-25 04:30:57,251 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.311e+02 3.534e+02 4.471e+02 5.534e+02 1.801e+03, threshold=8.941e+02, percent-clipped=10.0 2023-06-25 04:31:35,247 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1256622.0, ans=0.125 2023-06-25 04:32:09,623 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1256742.0, ans=0.2 2023-06-25 04:32:36,252 INFO [train.py:996] (0/4) Epoch 7, batch 26500, loss[loss=0.2463, simple_loss=0.3312, pruned_loss=0.08069, over 21678.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3078, pruned_loss=0.07781, over 4282006.05 frames. ], batch size: 414, lr: 4.19e-03, grad_scale: 32.0 2023-06-25 04:32:44,554 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.07 vs. limit=10.0 2023-06-25 04:33:02,415 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.19 vs. limit=15.0 2023-06-25 04:34:32,717 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.89 vs. limit=15.0 2023-06-25 04:34:33,101 INFO [train.py:996] (0/4) Epoch 7, batch 26550, loss[loss=0.2069, simple_loss=0.3089, pruned_loss=0.0525, over 21142.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.3058, pruned_loss=0.07533, over 4275470.35 frames. ], batch size: 548, lr: 4.19e-03, grad_scale: 16.0 2023-06-25 04:34:47,495 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.196e+02 3.332e+02 4.391e+02 7.235e+02 1.419e+03, threshold=8.782e+02, percent-clipped=20.0 2023-06-25 04:35:09,213 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 04:36:21,172 INFO [train.py:996] (0/4) Epoch 7, batch 26600, loss[loss=0.215, simple_loss=0.2815, pruned_loss=0.07419, over 21846.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.3042, pruned_loss=0.07256, over 4276281.26 frames. ], batch size: 107, lr: 4.19e-03, grad_scale: 16.0 2023-06-25 04:36:41,298 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1257402.0, ans=0.125 2023-06-25 04:36:57,692 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1257462.0, ans=0.1 2023-06-25 04:37:08,999 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.98 vs. limit=22.5 2023-06-25 04:37:29,152 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.90 vs. limit=12.0 2023-06-25 04:38:10,092 INFO [train.py:996] (0/4) Epoch 7, batch 26650, loss[loss=0.1609, simple_loss=0.2485, pruned_loss=0.03662, over 21709.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.2966, pruned_loss=0.07081, over 4262789.47 frames. ], batch size: 298, lr: 4.19e-03, grad_scale: 16.0 2023-06-25 04:38:28,649 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.814e+02 2.895e+02 3.400e+02 5.153e+02 1.068e+03, threshold=6.799e+02, percent-clipped=4.0 2023-06-25 04:38:44,905 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1257762.0, ans=0.125 2023-06-25 04:38:46,967 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1257762.0, ans=0.0 2023-06-25 04:39:08,685 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1257822.0, ans=0.0 2023-06-25 04:39:21,554 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.64 vs. limit=22.5 2023-06-25 04:39:29,464 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1257882.0, ans=0.2 2023-06-25 04:39:57,580 INFO [train.py:996] (0/4) Epoch 7, batch 26700, loss[loss=0.1966, simple_loss=0.2724, pruned_loss=0.06037, over 21842.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2897, pruned_loss=0.0679, over 4272820.33 frames. ], batch size: 282, lr: 4.19e-03, grad_scale: 16.0 2023-06-25 04:40:37,247 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1258122.0, ans=0.1 2023-06-25 04:41:12,711 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1258182.0, ans=0.04949747468305833 2023-06-25 04:41:52,591 INFO [train.py:996] (0/4) Epoch 7, batch 26750, loss[loss=0.2012, simple_loss=0.2803, pruned_loss=0.06107, over 21637.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2895, pruned_loss=0.06666, over 4280859.63 frames. ], batch size: 230, lr: 4.19e-03, grad_scale: 16.0 2023-06-25 04:42:06,352 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.047e+02 2.716e+02 3.514e+02 4.569e+02 1.217e+03, threshold=7.028e+02, percent-clipped=8.0 2023-06-25 04:42:07,220 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1258302.0, ans=0.1 2023-06-25 04:42:20,537 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1258362.0, ans=0.1 2023-06-25 04:43:29,211 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1258542.0, ans=0.1 2023-06-25 04:43:31,596 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 04:43:36,855 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1258542.0, ans=0.035 2023-06-25 04:43:43,454 INFO [train.py:996] (0/4) Epoch 7, batch 26800, loss[loss=0.2966, simple_loss=0.3519, pruned_loss=0.1207, over 21461.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.2977, pruned_loss=0.07156, over 4282112.35 frames. ], batch size: 510, lr: 4.19e-03, grad_scale: 32.0 2023-06-25 04:43:47,894 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1258602.0, ans=0.2 2023-06-25 04:44:31,397 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.46 vs. limit=15.0 2023-06-25 04:44:56,966 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.75 vs. limit=10.0 2023-06-25 04:45:03,828 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1258782.0, ans=0.2 2023-06-25 04:45:31,690 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 04:45:32,700 INFO [train.py:996] (0/4) Epoch 7, batch 26850, loss[loss=0.1964, simple_loss=0.2508, pruned_loss=0.071, over 20073.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.2988, pruned_loss=0.07354, over 4276510.25 frames. ], batch size: 703, lr: 4.19e-03, grad_scale: 32.0 2023-06-25 04:45:58,737 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.727e+02 3.580e+02 4.511e+02 5.580e+02 1.314e+03, threshold=9.022e+02, percent-clipped=13.0 2023-06-25 04:46:12,734 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1258962.0, ans=0.015 2023-06-25 04:47:22,412 INFO [train.py:996] (0/4) Epoch 7, batch 26900, loss[loss=0.1739, simple_loss=0.2381, pruned_loss=0.05478, over 21600.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2906, pruned_loss=0.07259, over 4271612.87 frames. ], batch size: 298, lr: 4.19e-03, grad_scale: 16.0 2023-06-25 04:48:11,230 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.28 vs. limit=22.5 2023-06-25 04:48:31,907 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1259382.0, ans=0.0 2023-06-25 04:49:06,712 INFO [train.py:996] (0/4) Epoch 7, batch 26950, loss[loss=0.2356, simple_loss=0.3111, pruned_loss=0.07999, over 21495.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2896, pruned_loss=0.07266, over 4250666.99 frames. ], batch size: 389, lr: 4.19e-03, grad_scale: 16.0 2023-06-25 04:49:13,644 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.24 vs. limit=6.0 2023-06-25 04:49:33,612 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.559e+02 3.020e+02 3.484e+02 4.294e+02 8.554e+02, threshold=6.967e+02, percent-clipped=0.0 2023-06-25 04:49:44,658 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1259562.0, ans=0.2 2023-06-25 04:50:25,278 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1259682.0, ans=0.0 2023-06-25 04:50:56,123 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1259802.0, ans=0.05 2023-06-25 04:51:02,039 INFO [train.py:996] (0/4) Epoch 7, batch 27000, loss[loss=0.1731, simple_loss=0.2538, pruned_loss=0.0462, over 21374.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.2907, pruned_loss=0.07105, over 4246998.06 frames. ], batch size: 211, lr: 4.19e-03, grad_scale: 16.0 2023-06-25 04:51:02,041 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-25 04:51:24,273 INFO [train.py:1028] (0/4) Epoch 7, validation: loss=0.2512, simple_loss=0.3463, pruned_loss=0.07806, over 1796401.00 frames. 2023-06-25 04:51:24,274 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23714MB 2023-06-25 04:51:29,372 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1259802.0, ans=0.125 2023-06-25 04:52:44,810 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1260042.0, ans=0.2 2023-06-25 04:53:00,236 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1260042.0, ans=0.1 2023-06-25 04:53:14,923 INFO [train.py:996] (0/4) Epoch 7, batch 27050, loss[loss=0.2203, simple_loss=0.2905, pruned_loss=0.07502, over 21222.00 frames. ], tot_loss[loss=0.215, simple_loss=0.2931, pruned_loss=0.06849, over 4247081.92 frames. ], batch size: 143, lr: 4.19e-03, grad_scale: 16.0 2023-06-25 04:53:34,730 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.941e+02 2.897e+02 3.762e+02 4.771e+02 8.226e+02, threshold=7.524e+02, percent-clipped=2.0 2023-06-25 04:54:00,511 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.75 vs. limit=15.0 2023-06-25 04:54:11,076 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1260222.0, ans=0.125 2023-06-25 04:54:29,280 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.12 vs. limit=15.0 2023-06-25 04:55:02,885 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1260402.0, ans=0.2 2023-06-25 04:55:02,905 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1260402.0, ans=0.125 2023-06-25 04:55:04,185 INFO [train.py:996] (0/4) Epoch 7, batch 27100, loss[loss=0.222, simple_loss=0.2901, pruned_loss=0.07698, over 21614.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.295, pruned_loss=0.06932, over 4250122.55 frames. ], batch size: 263, lr: 4.19e-03, grad_scale: 16.0 2023-06-25 04:55:06,304 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1260402.0, ans=0.0 2023-06-25 04:55:19,741 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1260402.0, ans=0.04949747468305833 2023-06-25 04:56:08,635 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1260582.0, ans=0.1 2023-06-25 04:56:24,117 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1260582.0, ans=0.125 2023-06-25 04:56:37,378 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1260642.0, ans=0.0 2023-06-25 04:56:53,980 INFO [train.py:996] (0/4) Epoch 7, batch 27150, loss[loss=0.2399, simple_loss=0.3423, pruned_loss=0.0688, over 21751.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.3054, pruned_loss=0.07189, over 4264213.63 frames. ], batch size: 351, lr: 4.19e-03, grad_scale: 16.0 2023-06-25 04:57:19,855 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.408e+02 3.400e+02 4.098e+02 5.830e+02 1.178e+03, threshold=8.196e+02, percent-clipped=9.0 2023-06-25 04:57:22,096 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1260762.0, ans=0.125 2023-06-25 04:57:30,409 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.31 vs. limit=15.0 2023-06-25 04:57:36,790 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.77 vs. limit=12.0 2023-06-25 04:57:39,994 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1260822.0, ans=0.1 2023-06-25 04:58:16,818 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1260882.0, ans=0.04949747468305833 2023-06-25 04:58:42,867 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff3.min_abs, batch_count=1260942.0, ans=0.2 2023-06-25 04:58:53,838 INFO [train.py:996] (0/4) Epoch 7, batch 27200, loss[loss=0.2455, simple_loss=0.3241, pruned_loss=0.08344, over 21611.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3125, pruned_loss=0.0742, over 4276488.95 frames. ], batch size: 263, lr: 4.19e-03, grad_scale: 32.0 2023-06-25 04:59:02,349 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.23 vs. limit=15.0 2023-06-25 04:59:47,072 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.65 vs. limit=10.0 2023-06-25 04:59:47,878 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1261122.0, ans=0.0 2023-06-25 05:00:44,398 INFO [train.py:996] (0/4) Epoch 7, batch 27250, loss[loss=0.2546, simple_loss=0.3178, pruned_loss=0.09564, over 21800.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3152, pruned_loss=0.07803, over 4278477.79 frames. ], batch size: 247, lr: 4.18e-03, grad_scale: 16.0 2023-06-25 05:01:02,617 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.541e+02 3.239e+02 3.756e+02 4.583e+02 7.251e+02, threshold=7.513e+02, percent-clipped=0.0 2023-06-25 05:01:21,008 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1261362.0, ans=0.125 2023-06-25 05:01:37,754 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1261422.0, ans=0.0 2023-06-25 05:01:48,514 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1261422.0, ans=0.2 2023-06-25 05:01:48,553 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1261422.0, ans=0.025 2023-06-25 05:01:50,079 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1261422.0, ans=0.125 2023-06-25 05:02:06,349 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1261482.0, ans=0.125 2023-06-25 05:02:22,667 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1261542.0, ans=0.035 2023-06-25 05:02:36,162 INFO [train.py:996] (0/4) Epoch 7, batch 27300, loss[loss=0.239, simple_loss=0.3178, pruned_loss=0.08013, over 21474.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3173, pruned_loss=0.07944, over 4277554.43 frames. ], batch size: 211, lr: 4.18e-03, grad_scale: 16.0 2023-06-25 05:03:23,662 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1261662.0, ans=0.07 2023-06-25 05:04:02,912 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1261782.0, ans=0.04949747468305833 2023-06-25 05:04:26,393 INFO [train.py:996] (0/4) Epoch 7, batch 27350, loss[loss=0.2422, simple_loss=0.3322, pruned_loss=0.07613, over 21629.00 frames. ], tot_loss[loss=0.2406, simple_loss=0.3201, pruned_loss=0.08051, over 4277212.22 frames. ], batch size: 414, lr: 4.18e-03, grad_scale: 16.0 2023-06-25 05:04:47,363 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1261962.0, ans=0.125 2023-06-25 05:04:48,250 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.380e+02 3.470e+02 4.790e+02 5.992e+02 9.415e+02, threshold=9.580e+02, percent-clipped=9.0 2023-06-25 05:05:21,259 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 05:06:18,602 INFO [train.py:996] (0/4) Epoch 7, batch 27400, loss[loss=0.2132, simple_loss=0.28, pruned_loss=0.07322, over 21768.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.3151, pruned_loss=0.08002, over 4281842.50 frames. ], batch size: 371, lr: 4.18e-03, grad_scale: 16.0 2023-06-25 05:06:52,013 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1262262.0, ans=0.125 2023-06-25 05:06:52,113 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1262262.0, ans=0.125 2023-06-25 05:07:28,602 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1262382.0, ans=0.125 2023-06-25 05:08:08,413 INFO [train.py:996] (0/4) Epoch 7, batch 27450, loss[loss=0.2224, simple_loss=0.316, pruned_loss=0.06442, over 21405.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3085, pruned_loss=0.07784, over 4283772.24 frames. ], batch size: 194, lr: 4.18e-03, grad_scale: 16.0 2023-06-25 05:08:36,545 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.458e+02 3.140e+02 3.820e+02 5.353e+02 9.307e+02, threshold=7.640e+02, percent-clipped=0.0 2023-06-25 05:08:52,290 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1262622.0, ans=0.125 2023-06-25 05:09:07,200 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1262622.0, ans=0.1 2023-06-25 05:09:33,501 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1262742.0, ans=0.04949747468305833 2023-06-25 05:09:50,498 INFO [train.py:996] (0/4) Epoch 7, batch 27500, loss[loss=0.1964, simple_loss=0.261, pruned_loss=0.0659, over 21183.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3062, pruned_loss=0.07728, over 4289033.73 frames. ], batch size: 608, lr: 4.18e-03, grad_scale: 16.0 2023-06-25 05:11:43,249 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.38 vs. limit=6.0 2023-06-25 05:11:43,529 INFO [train.py:996] (0/4) Epoch 7, batch 27550, loss[loss=0.211, simple_loss=0.2818, pruned_loss=0.07015, over 21737.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.301, pruned_loss=0.0747, over 4287214.81 frames. ], batch size: 351, lr: 4.18e-03, grad_scale: 16.0 2023-06-25 05:11:45,641 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1263102.0, ans=0.125 2023-06-25 05:12:10,958 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.562e+02 3.311e+02 4.001e+02 4.826e+02 1.149e+03, threshold=8.002e+02, percent-clipped=4.0 2023-06-25 05:12:22,496 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.43 vs. limit=12.0 2023-06-25 05:13:29,712 INFO [train.py:996] (0/4) Epoch 7, batch 27600, loss[loss=0.1945, simple_loss=0.2595, pruned_loss=0.06468, over 21549.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.2963, pruned_loss=0.07427, over 4274519.64 frames. ], batch size: 263, lr: 4.18e-03, grad_scale: 32.0 2023-06-25 05:13:47,482 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1263402.0, ans=0.5 2023-06-25 05:14:03,458 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.29 vs. limit=15.0 2023-06-25 05:14:42,037 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1263582.0, ans=0.0 2023-06-25 05:15:10,404 INFO [train.py:996] (0/4) Epoch 7, batch 27650, loss[loss=0.2064, simple_loss=0.2662, pruned_loss=0.07336, over 21455.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.2901, pruned_loss=0.0731, over 4277609.13 frames. ], batch size: 194, lr: 4.18e-03, grad_scale: 32.0 2023-06-25 05:15:37,192 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.286e+02 3.109e+02 3.684e+02 5.059e+02 1.214e+03, threshold=7.368e+02, percent-clipped=6.0 2023-06-25 05:15:53,607 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1263762.0, ans=0.0 2023-06-25 05:16:19,742 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1263822.0, ans=0.125 2023-06-25 05:16:27,122 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1263882.0, ans=0.0 2023-06-25 05:16:35,999 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.34 vs. limit=15.0 2023-06-25 05:16:37,166 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1263882.0, ans=0.04949747468305833 2023-06-25 05:16:53,215 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1263942.0, ans=0.1 2023-06-25 05:16:57,731 INFO [train.py:996] (0/4) Epoch 7, batch 27700, loss[loss=0.2225, simple_loss=0.3072, pruned_loss=0.06891, over 21789.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.291, pruned_loss=0.07136, over 4277336.04 frames. ], batch size: 332, lr: 4.18e-03, grad_scale: 16.0 2023-06-25 05:17:18,913 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1264002.0, ans=0.125 2023-06-25 05:17:33,431 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.02 vs. limit=15.0 2023-06-25 05:18:27,117 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.45 vs. limit=12.0 2023-06-25 05:18:49,457 INFO [train.py:996] (0/4) Epoch 7, batch 27750, loss[loss=0.205, simple_loss=0.2908, pruned_loss=0.05958, over 21807.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2945, pruned_loss=0.07115, over 4285682.51 frames. ], batch size: 332, lr: 4.18e-03, grad_scale: 16.0 2023-06-25 05:19:19,115 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.393e+02 2.962e+02 3.488e+02 4.454e+02 9.416e+02, threshold=6.976e+02, percent-clipped=4.0 2023-06-25 05:19:30,017 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1264362.0, ans=0.125 2023-06-25 05:19:43,545 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1264422.0, ans=0.125 2023-06-25 05:20:36,091 INFO [train.py:996] (0/4) Epoch 7, batch 27800, loss[loss=0.2214, simple_loss=0.2876, pruned_loss=0.07757, over 21632.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2923, pruned_loss=0.07149, over 4280208.37 frames. ], batch size: 195, lr: 4.18e-03, grad_scale: 16.0 2023-06-25 05:20:38,906 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1264602.0, ans=0.1 2023-06-25 05:21:09,709 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1264662.0, ans=0.125 2023-06-25 05:21:37,182 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1264722.0, ans=0.1 2023-06-25 05:21:52,015 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.23 vs. limit=15.0 2023-06-25 05:21:52,900 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1264782.0, ans=0.0 2023-06-25 05:22:07,618 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1264842.0, ans=0.0 2023-06-25 05:22:24,421 INFO [train.py:996] (0/4) Epoch 7, batch 27850, loss[loss=0.1957, simple_loss=0.2597, pruned_loss=0.06589, over 21204.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.291, pruned_loss=0.07234, over 4283182.75 frames. ], batch size: 608, lr: 4.18e-03, grad_scale: 8.0 2023-06-25 05:22:57,342 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.414e+02 3.134e+02 3.811e+02 5.096e+02 8.843e+02, threshold=7.621e+02, percent-clipped=7.0 2023-06-25 05:23:57,389 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1265142.0, ans=0.0 2023-06-25 05:24:11,164 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1265142.0, ans=0.1 2023-06-25 05:24:27,039 INFO [train.py:996] (0/4) Epoch 7, batch 27900, loss[loss=0.2553, simple_loss=0.3475, pruned_loss=0.08152, over 21620.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.299, pruned_loss=0.073, over 4281845.80 frames. ], batch size: 441, lr: 4.18e-03, grad_scale: 8.0 2023-06-25 05:25:11,065 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1265322.0, ans=0.2 2023-06-25 05:26:21,596 INFO [train.py:996] (0/4) Epoch 7, batch 27950, loss[loss=0.2095, simple_loss=0.3023, pruned_loss=0.05833, over 21717.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.3, pruned_loss=0.07021, over 4284185.57 frames. ], batch size: 332, lr: 4.18e-03, grad_scale: 8.0 2023-06-25 05:26:42,606 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.145e+02 3.117e+02 4.053e+02 5.979e+02 1.114e+03, threshold=8.107e+02, percent-clipped=11.0 2023-06-25 05:27:49,302 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1265742.0, ans=0.125 2023-06-25 05:28:09,601 INFO [train.py:996] (0/4) Epoch 7, batch 28000, loss[loss=0.2514, simple_loss=0.3223, pruned_loss=0.09027, over 21756.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2985, pruned_loss=0.06815, over 4282719.92 frames. ], batch size: 112, lr: 4.18e-03, grad_scale: 16.0 2023-06-25 05:28:17,664 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1265802.0, ans=0.125 2023-06-25 05:28:24,717 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1265802.0, ans=0.125 2023-06-25 05:28:53,455 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1265922.0, ans=0.0 2023-06-25 05:29:02,498 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff3.min_abs, batch_count=1265922.0, ans=0.2 2023-06-25 05:29:09,346 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 05:30:01,607 INFO [train.py:996] (0/4) Epoch 7, batch 28050, loss[loss=0.2209, simple_loss=0.3196, pruned_loss=0.06114, over 20848.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.2968, pruned_loss=0.06981, over 4281770.66 frames. ], batch size: 608, lr: 4.18e-03, grad_scale: 16.0 2023-06-25 05:30:22,381 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.141e+02 2.952e+02 3.818e+02 5.160e+02 1.220e+03, threshold=7.636e+02, percent-clipped=4.0 2023-06-25 05:30:45,408 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1266222.0, ans=0.2 2023-06-25 05:31:51,601 INFO [train.py:996] (0/4) Epoch 7, batch 28100, loss[loss=0.2278, simple_loss=0.2895, pruned_loss=0.08307, over 21864.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2944, pruned_loss=0.06999, over 4274556.81 frames. ], batch size: 98, lr: 4.18e-03, grad_scale: 16.0 2023-06-25 05:32:11,475 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1266462.0, ans=0.125 2023-06-25 05:32:16,920 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1266462.0, ans=0.1 2023-06-25 05:32:44,394 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1266522.0, ans=0.2 2023-06-25 05:33:11,540 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1266582.0, ans=0.0 2023-06-25 05:33:13,774 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1266582.0, ans=0.125 2023-06-25 05:33:40,487 INFO [train.py:996] (0/4) Epoch 7, batch 28150, loss[loss=0.1946, simple_loss=0.2579, pruned_loss=0.06561, over 21654.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.287, pruned_loss=0.06962, over 4270517.21 frames. ], batch size: 298, lr: 4.18e-03, grad_scale: 16.0 2023-06-25 05:33:55,022 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1266702.0, ans=0.0 2023-06-25 05:34:01,757 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.438e+02 3.369e+02 4.176e+02 5.786e+02 1.041e+03, threshold=8.353e+02, percent-clipped=8.0 2023-06-25 05:34:06,042 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1266762.0, ans=0.125 2023-06-25 05:34:42,169 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.24 vs. limit=22.5 2023-06-25 05:34:44,019 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.40 vs. limit=22.5 2023-06-25 05:35:29,309 INFO [train.py:996] (0/4) Epoch 7, batch 28200, loss[loss=0.2427, simple_loss=0.3137, pruned_loss=0.08582, over 21201.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.2872, pruned_loss=0.07124, over 4269500.90 frames. ], batch size: 143, lr: 4.18e-03, grad_scale: 16.0 2023-06-25 05:36:30,388 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1267122.0, ans=0.0 2023-06-25 05:36:31,124 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.71 vs. limit=15.0 2023-06-25 05:36:51,069 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1267182.0, ans=0.2 2023-06-25 05:37:00,991 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1267242.0, ans=0.125 2023-06-25 05:37:13,161 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1267242.0, ans=0.07 2023-06-25 05:37:17,993 INFO [train.py:996] (0/4) Epoch 7, batch 28250, loss[loss=0.2566, simple_loss=0.2985, pruned_loss=0.1074, over 21445.00 frames. ], tot_loss[loss=0.22, simple_loss=0.2925, pruned_loss=0.07376, over 4269425.15 frames. ], batch size: 475, lr: 4.17e-03, grad_scale: 16.0 2023-06-25 05:37:23,725 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1267302.0, ans=0.1 2023-06-25 05:37:43,658 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.555e+02 3.449e+02 4.309e+02 5.866e+02 1.082e+03, threshold=8.618e+02, percent-clipped=6.0 2023-06-25 05:38:05,945 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1267362.0, ans=0.125 2023-06-25 05:38:07,502 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1267422.0, ans=0.0 2023-06-25 05:38:13,378 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.57 vs. limit=8.0 2023-06-25 05:38:41,545 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1267482.0, ans=0.0 2023-06-25 05:39:08,750 INFO [train.py:996] (0/4) Epoch 7, batch 28300, loss[loss=0.1901, simple_loss=0.2836, pruned_loss=0.04832, over 21707.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2895, pruned_loss=0.07184, over 4254600.43 frames. ], batch size: 298, lr: 4.17e-03, grad_scale: 16.0 2023-06-25 05:40:00,747 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1267722.0, ans=0.125 2023-06-25 05:40:11,826 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1267722.0, ans=0.125 2023-06-25 05:40:13,715 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1267722.0, ans=0.125 2023-06-25 05:40:57,529 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1267842.0, ans=0.0 2023-06-25 05:41:00,921 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1267842.0, ans=0.125 2023-06-25 05:41:03,578 INFO [train.py:996] (0/4) Epoch 7, batch 28350, loss[loss=0.1894, simple_loss=0.3204, pruned_loss=0.02919, over 20790.00 frames. ], tot_loss[loss=0.21, simple_loss=0.2864, pruned_loss=0.06677, over 4253618.45 frames. ], batch size: 607, lr: 4.17e-03, grad_scale: 16.0 2023-06-25 05:41:25,121 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1267962.0, ans=0.0 2023-06-25 05:41:29,701 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 1.944e+02 2.753e+02 3.449e+02 4.988e+02 1.144e+03, threshold=6.899e+02, percent-clipped=4.0 2023-06-25 05:42:02,323 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=16.05 vs. limit=22.5 2023-06-25 05:42:29,576 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1268142.0, ans=0.125 2023-06-25 05:42:29,747 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1268142.0, ans=0.1 2023-06-25 05:42:51,317 INFO [train.py:996] (0/4) Epoch 7, batch 28400, loss[loss=0.2391, simple_loss=0.3089, pruned_loss=0.08466, over 21329.00 frames. ], tot_loss[loss=0.2072, simple_loss=0.2817, pruned_loss=0.06632, over 4232540.44 frames. ], batch size: 549, lr: 4.17e-03, grad_scale: 32.0 2023-06-25 05:43:37,193 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1268322.0, ans=0.0 2023-06-25 05:43:37,691 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.53 vs. limit=10.0 2023-06-25 05:43:56,206 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1268382.0, ans=0.0 2023-06-25 05:43:56,309 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1268382.0, ans=0.125 2023-06-25 05:44:41,994 INFO [train.py:996] (0/4) Epoch 7, batch 28450, loss[loss=0.2208, simple_loss=0.2938, pruned_loss=0.07396, over 21949.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2848, pruned_loss=0.06876, over 4239614.44 frames. ], batch size: 316, lr: 4.17e-03, grad_scale: 16.0 2023-06-25 05:44:51,393 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1268502.0, ans=0.1 2023-06-25 05:45:05,761 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1268562.0, ans=0.125 2023-06-25 05:45:15,053 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.500e+02 3.249e+02 3.944e+02 5.811e+02 1.668e+03, threshold=7.889e+02, percent-clipped=19.0 2023-06-25 05:45:41,990 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1268622.0, ans=0.1 2023-06-25 05:46:36,232 INFO [train.py:996] (0/4) Epoch 7, batch 28500, loss[loss=0.2378, simple_loss=0.3098, pruned_loss=0.08291, over 21768.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2889, pruned_loss=0.07215, over 4257747.24 frames. ], batch size: 298, lr: 4.17e-03, grad_scale: 16.0 2023-06-25 05:47:00,084 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.37 vs. limit=15.0 2023-06-25 05:47:10,316 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1268862.0, ans=0.0 2023-06-25 05:47:15,673 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1268922.0, ans=0.125 2023-06-25 05:48:17,930 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1269042.0, ans=0.1 2023-06-25 05:48:20,219 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.38 vs. limit=22.5 2023-06-25 05:48:31,035 INFO [train.py:996] (0/4) Epoch 7, batch 28550, loss[loss=0.2758, simple_loss=0.3578, pruned_loss=0.09686, over 21750.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.2977, pruned_loss=0.07452, over 4263323.71 frames. ], batch size: 441, lr: 4.17e-03, grad_scale: 16.0 2023-06-25 05:48:40,167 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1269102.0, ans=0.2 2023-06-25 05:48:53,558 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.539e+02 3.516e+02 4.419e+02 5.883e+02 1.246e+03, threshold=8.838e+02, percent-clipped=8.0 2023-06-25 05:48:56,151 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1269162.0, ans=0.0 2023-06-25 05:48:57,637 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1269162.0, ans=0.125 2023-06-25 05:49:40,753 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1269282.0, ans=0.2 2023-06-25 05:50:18,734 INFO [train.py:996] (0/4) Epoch 7, batch 28600, loss[loss=0.2299, simple_loss=0.3072, pruned_loss=0.07629, over 21747.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3055, pruned_loss=0.07794, over 4268809.59 frames. ], batch size: 124, lr: 4.17e-03, grad_scale: 16.0 2023-06-25 05:50:40,971 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1269462.0, ans=0.0 2023-06-25 05:50:41,057 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_na.min_abs, batch_count=1269462.0, ans=0.02 2023-06-25 05:50:46,087 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1269462.0, ans=0.0 2023-06-25 05:50:51,339 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1269462.0, ans=0.0 2023-06-25 05:50:51,354 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1269462.0, ans=0.2 2023-06-25 05:52:07,670 INFO [train.py:996] (0/4) Epoch 7, batch 28650, loss[loss=0.1932, simple_loss=0.2604, pruned_loss=0.06303, over 21534.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.3006, pruned_loss=0.07717, over 4269371.67 frames. ], batch size: 263, lr: 4.17e-03, grad_scale: 16.0 2023-06-25 05:52:08,177 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1269702.0, ans=0.0 2023-06-25 05:52:11,727 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1269702.0, ans=0.125 2023-06-25 05:52:23,897 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1269762.0, ans=0.0 2023-06-25 05:52:30,246 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.450e+02 3.536e+02 4.575e+02 6.589e+02 8.896e+02, threshold=9.150e+02, percent-clipped=1.0 2023-06-25 05:52:40,964 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=10.14 vs. limit=15.0 2023-06-25 05:53:28,146 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1269882.0, ans=0.0 2023-06-25 05:53:40,440 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1269942.0, ans=0.125 2023-06-25 05:53:49,289 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.99 vs. limit=15.0 2023-06-25 05:53:55,687 INFO [train.py:996] (0/4) Epoch 7, batch 28700, loss[loss=0.189, simple_loss=0.2453, pruned_loss=0.06638, over 21254.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3004, pruned_loss=0.07806, over 4262110.47 frames. ], batch size: 549, lr: 4.17e-03, grad_scale: 16.0 2023-06-25 05:53:56,356 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1270002.0, ans=0.125 2023-06-25 05:54:08,340 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1270002.0, ans=0.0 2023-06-25 05:54:58,492 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1270122.0, ans=0.125 2023-06-25 05:55:23,270 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1270182.0, ans=0.0 2023-06-25 05:55:33,650 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1270242.0, ans=0.125 2023-06-25 05:55:43,782 INFO [train.py:996] (0/4) Epoch 7, batch 28750, loss[loss=0.2379, simple_loss=0.3074, pruned_loss=0.08419, over 21301.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3008, pruned_loss=0.07766, over 4261126.62 frames. ], batch size: 143, lr: 4.17e-03, grad_scale: 16.0 2023-06-25 05:55:48,327 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.74 vs. limit=15.0 2023-06-25 05:55:58,832 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.70 vs. limit=22.5 2023-06-25 05:56:06,406 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.645e+02 3.238e+02 3.725e+02 5.020e+02 9.578e+02, threshold=7.449e+02, percent-clipped=2.0 2023-06-25 05:56:41,223 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1270422.0, ans=0.2 2023-06-25 05:57:25,456 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1270542.0, ans=0.025 2023-06-25 05:57:33,211 INFO [train.py:996] (0/4) Epoch 7, batch 28800, loss[loss=0.2595, simple_loss=0.3302, pruned_loss=0.09435, over 21780.00 frames. ], tot_loss[loss=0.23, simple_loss=0.3036, pruned_loss=0.07816, over 4270225.05 frames. ], batch size: 332, lr: 4.17e-03, grad_scale: 32.0 2023-06-25 05:57:44,467 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.73 vs. limit=10.0 2023-06-25 05:57:49,332 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1270662.0, ans=0.2 2023-06-25 05:58:27,085 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1270722.0, ans=0.0 2023-06-25 05:59:06,841 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1270842.0, ans=0.125 2023-06-25 05:59:22,084 INFO [train.py:996] (0/4) Epoch 7, batch 28850, loss[loss=0.2634, simple_loss=0.3667, pruned_loss=0.08011, over 19962.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3049, pruned_loss=0.07962, over 4275034.99 frames. ], batch size: 702, lr: 4.17e-03, grad_scale: 16.0 2023-06-25 05:59:22,775 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1270902.0, ans=0.0 2023-06-25 06:00:02,819 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.622e+02 3.393e+02 4.119e+02 6.059e+02 1.112e+03, threshold=8.239e+02, percent-clipped=12.0 2023-06-25 06:00:12,839 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1271022.0, ans=0.125 2023-06-25 06:01:17,976 INFO [train.py:996] (0/4) Epoch 7, batch 28900, loss[loss=0.2364, simple_loss=0.2908, pruned_loss=0.09098, over 21614.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3089, pruned_loss=0.08195, over 4278354.53 frames. ], batch size: 548, lr: 4.17e-03, grad_scale: 16.0 2023-06-25 06:01:21,964 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1271202.0, ans=0.2 2023-06-25 06:01:58,916 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.88 vs. limit=15.0 2023-06-25 06:03:09,346 INFO [train.py:996] (0/4) Epoch 7, batch 28950, loss[loss=0.2197, simple_loss=0.3053, pruned_loss=0.06705, over 21764.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3102, pruned_loss=0.0811, over 4272782.43 frames. ], batch size: 332, lr: 4.17e-03, grad_scale: 8.0 2023-06-25 06:03:46,138 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.351e+02 3.609e+02 4.387e+02 5.987e+02 1.071e+03, threshold=8.774e+02, percent-clipped=6.0 2023-06-25 06:03:57,103 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1271622.0, ans=0.125 2023-06-25 06:04:05,718 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1271622.0, ans=0.125 2023-06-25 06:04:09,461 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1271622.0, ans=0.125 2023-06-25 06:04:12,904 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1271682.0, ans=0.125 2023-06-25 06:04:35,783 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1271742.0, ans=0.0 2023-06-25 06:04:37,630 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1271742.0, ans=0.125 2023-06-25 06:05:02,839 INFO [train.py:996] (0/4) Epoch 7, batch 29000, loss[loss=0.2482, simple_loss=0.3219, pruned_loss=0.08728, over 21340.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3116, pruned_loss=0.0791, over 4269333.06 frames. ], batch size: 549, lr: 4.17e-03, grad_scale: 8.0 2023-06-25 06:05:26,150 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.99 vs. limit=6.0 2023-06-25 06:06:04,608 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-212000.pt 2023-06-25 06:06:51,575 INFO [train.py:996] (0/4) Epoch 7, batch 29050, loss[loss=0.2048, simple_loss=0.2809, pruned_loss=0.06437, over 21662.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3098, pruned_loss=0.07942, over 4272011.03 frames. ], batch size: 263, lr: 4.17e-03, grad_scale: 8.0 2023-06-25 06:07:18,772 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1272162.0, ans=0.0 2023-06-25 06:07:21,612 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.493e+02 3.635e+02 4.186e+02 5.307e+02 1.029e+03, threshold=8.372e+02, percent-clipped=1.0 2023-06-25 06:08:33,081 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1272342.0, ans=0.125 2023-06-25 06:08:37,189 INFO [train.py:996] (0/4) Epoch 7, batch 29100, loss[loss=0.1787, simple_loss=0.245, pruned_loss=0.05617, over 21620.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.3011, pruned_loss=0.07733, over 4281847.13 frames. ], batch size: 298, lr: 4.17e-03, grad_scale: 8.0 2023-06-25 06:08:53,018 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1272402.0, ans=0.125 2023-06-25 06:09:13,232 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1272462.0, ans=0.125 2023-06-25 06:09:30,552 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1272522.0, ans=0.0 2023-06-25 06:10:19,655 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.69 vs. limit=22.5 2023-06-25 06:10:23,689 INFO [train.py:996] (0/4) Epoch 7, batch 29150, loss[loss=0.2116, simple_loss=0.2923, pruned_loss=0.0654, over 21675.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.2984, pruned_loss=0.0753, over 4285213.14 frames. ], batch size: 247, lr: 4.17e-03, grad_scale: 8.0 2023-06-25 06:10:34,600 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1272702.0, ans=0.0 2023-06-25 06:10:45,095 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.79 vs. limit=22.5 2023-06-25 06:10:54,198 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.266e+02 3.210e+02 4.222e+02 5.476e+02 9.873e+02, threshold=8.444e+02, percent-clipped=1.0 2023-06-25 06:10:56,288 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1272762.0, ans=0.125 2023-06-25 06:11:11,998 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1272822.0, ans=0.125 2023-06-25 06:11:47,927 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.50 vs. limit=15.0 2023-06-25 06:12:10,653 INFO [train.py:996] (0/4) Epoch 7, batch 29200, loss[loss=0.2116, simple_loss=0.2682, pruned_loss=0.07755, over 20063.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.2931, pruned_loss=0.07408, over 4274524.26 frames. ], batch size: 702, lr: 4.17e-03, grad_scale: 16.0 2023-06-25 06:12:49,301 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.88 vs. limit=15.0 2023-06-25 06:13:09,569 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1273122.0, ans=0.125 2023-06-25 06:14:05,292 INFO [train.py:996] (0/4) Epoch 7, batch 29250, loss[loss=0.1939, simple_loss=0.2752, pruned_loss=0.05626, over 21232.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.291, pruned_loss=0.07208, over 4271689.08 frames. ], batch size: 176, lr: 4.16e-03, grad_scale: 16.0 2023-06-25 06:14:12,798 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1273302.0, ans=0.125 2023-06-25 06:14:21,466 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1273362.0, ans=0.5 2023-06-25 06:14:31,559 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.408e+02 3.162e+02 4.067e+02 5.479e+02 1.081e+03, threshold=8.134e+02, percent-clipped=3.0 2023-06-25 06:14:52,152 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1273422.0, ans=0.0 2023-06-25 06:14:55,162 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1273422.0, ans=0.0 2023-06-25 06:14:57,270 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1273422.0, ans=0.0 2023-06-25 06:15:48,997 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1273542.0, ans=0.1 2023-06-25 06:15:53,734 INFO [train.py:996] (0/4) Epoch 7, batch 29300, loss[loss=0.1912, simple_loss=0.2683, pruned_loss=0.05708, over 19753.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.293, pruned_loss=0.07132, over 4274256.32 frames. ], batch size: 703, lr: 4.16e-03, grad_scale: 16.0 2023-06-25 06:16:08,002 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1273602.0, ans=0.2 2023-06-25 06:16:31,262 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.75 vs. limit=15.0 2023-06-25 06:16:32,625 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1273722.0, ans=0.125 2023-06-25 06:16:43,960 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.75 vs. limit=15.0 2023-06-25 06:17:36,491 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.72 vs. limit=6.0 2023-06-25 06:17:42,115 INFO [train.py:996] (0/4) Epoch 7, batch 29350, loss[loss=0.1969, simple_loss=0.2695, pruned_loss=0.06217, over 21365.00 frames. ], tot_loss[loss=0.216, simple_loss=0.2897, pruned_loss=0.07118, over 4269810.56 frames. ], batch size: 131, lr: 4.16e-03, grad_scale: 16.0 2023-06-25 06:17:55,161 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1273902.0, ans=0.125 2023-06-25 06:18:13,791 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.302e+02 3.026e+02 3.822e+02 5.352e+02 1.093e+03, threshold=7.644e+02, percent-clipped=3.0 2023-06-25 06:18:18,347 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.70 vs. limit=10.0 2023-06-25 06:19:02,654 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.16 vs. limit=6.0 2023-06-25 06:19:30,141 INFO [train.py:996] (0/4) Epoch 7, batch 29400, loss[loss=0.1983, simple_loss=0.2766, pruned_loss=0.06005, over 21754.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.2893, pruned_loss=0.06845, over 4266950.45 frames. ], batch size: 352, lr: 4.16e-03, grad_scale: 16.0 2023-06-25 06:20:07,130 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.74 vs. limit=22.5 2023-06-25 06:20:20,998 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1274322.0, ans=0.125 2023-06-25 06:20:24,351 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1274322.0, ans=0.0 2023-06-25 06:20:24,376 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1274322.0, ans=0.125 2023-06-25 06:20:33,055 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1274322.0, ans=0.125 2023-06-25 06:20:46,904 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1274382.0, ans=0.125 2023-06-25 06:21:00,091 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.45 vs. limit=10.0 2023-06-25 06:21:08,068 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1274442.0, ans=0.2 2023-06-25 06:21:20,155 INFO [train.py:996] (0/4) Epoch 7, batch 29450, loss[loss=0.2482, simple_loss=0.3225, pruned_loss=0.08693, over 21739.00 frames. ], tot_loss[loss=0.2107, simple_loss=0.2868, pruned_loss=0.06727, over 4273279.45 frames. ], batch size: 332, lr: 4.16e-03, grad_scale: 8.0 2023-06-25 06:21:53,723 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.339e+02 3.532e+02 4.385e+02 5.559e+02 1.410e+03, threshold=8.770e+02, percent-clipped=9.0 2023-06-25 06:22:05,875 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.37 vs. limit=15.0 2023-06-25 06:22:10,147 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1274622.0, ans=0.125 2023-06-25 06:22:24,411 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.42 vs. limit=15.0 2023-06-25 06:23:08,505 INFO [train.py:996] (0/4) Epoch 7, batch 29500, loss[loss=0.215, simple_loss=0.2809, pruned_loss=0.07457, over 21568.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2929, pruned_loss=0.07108, over 4272781.04 frames. ], batch size: 548, lr: 4.16e-03, grad_scale: 8.0 2023-06-25 06:23:45,088 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1274862.0, ans=0.125 2023-06-25 06:24:12,403 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1274922.0, ans=0.0 2023-06-25 06:24:28,629 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.10 vs. limit=10.0 2023-06-25 06:24:34,527 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.14 vs. limit=22.5 2023-06-25 06:24:56,248 INFO [train.py:996] (0/4) Epoch 7, batch 29550, loss[loss=0.226, simple_loss=0.293, pruned_loss=0.07953, over 21338.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.2935, pruned_loss=0.07301, over 4278256.46 frames. ], batch size: 159, lr: 4.16e-03, grad_scale: 8.0 2023-06-25 06:25:13,290 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1275102.0, ans=0.2 2023-06-25 06:25:30,061 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.666e+02 3.932e+02 4.748e+02 5.685e+02 9.373e+02, threshold=9.495e+02, percent-clipped=3.0 2023-06-25 06:25:42,421 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1275222.0, ans=0.125 2023-06-25 06:26:11,500 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1275282.0, ans=0.0 2023-06-25 06:26:45,592 INFO [train.py:996] (0/4) Epoch 7, batch 29600, loss[loss=0.2569, simple_loss=0.3457, pruned_loss=0.08404, over 21766.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.2999, pruned_loss=0.07547, over 4286514.86 frames. ], batch size: 332, lr: 4.16e-03, grad_scale: 16.0 2023-06-25 06:26:53,843 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1275402.0, ans=0.125 2023-06-25 06:27:04,970 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1275402.0, ans=0.1 2023-06-25 06:28:22,070 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1275642.0, ans=0.125 2023-06-25 06:28:33,300 INFO [train.py:996] (0/4) Epoch 7, batch 29650, loss[loss=0.1761, simple_loss=0.2514, pruned_loss=0.05037, over 21794.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.2989, pruned_loss=0.07282, over 4288512.93 frames. ], batch size: 282, lr: 4.16e-03, grad_scale: 16.0 2023-06-25 06:29:16,846 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.311e+02 3.458e+02 4.326e+02 5.325e+02 1.074e+03, threshold=8.651e+02, percent-clipped=3.0 2023-06-25 06:30:04,879 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1275942.0, ans=0.125 2023-06-25 06:30:17,443 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1275942.0, ans=0.0 2023-06-25 06:30:27,052 INFO [train.py:996] (0/4) Epoch 7, batch 29700, loss[loss=0.315, simple_loss=0.4095, pruned_loss=0.1102, over 21537.00 frames. ], tot_loss[loss=0.222, simple_loss=0.2992, pruned_loss=0.07243, over 4293143.27 frames. ], batch size: 471, lr: 4.16e-03, grad_scale: 16.0 2023-06-25 06:30:41,985 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1276002.0, ans=0.05 2023-06-25 06:31:57,643 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1276242.0, ans=0.125 2023-06-25 06:32:16,271 INFO [train.py:996] (0/4) Epoch 7, batch 29750, loss[loss=0.2764, simple_loss=0.3589, pruned_loss=0.09697, over 21537.00 frames. ], tot_loss[loss=0.224, simple_loss=0.3035, pruned_loss=0.07223, over 4281547.93 frames. ], batch size: 507, lr: 4.16e-03, grad_scale: 16.0 2023-06-25 06:32:54,081 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.514e+02 3.299e+02 3.896e+02 4.722e+02 1.232e+03, threshold=7.792e+02, percent-clipped=5.0 2023-06-25 06:32:58,693 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.95 vs. limit=15.0 2023-06-25 06:33:08,649 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 06:33:08,675 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1276422.0, ans=0.07 2023-06-25 06:33:09,084 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.97 vs. limit=22.5 2023-06-25 06:33:21,351 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.31 vs. limit=15.0 2023-06-25 06:33:45,155 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.07 vs. limit=22.5 2023-06-25 06:34:03,387 INFO [train.py:996] (0/4) Epoch 7, batch 29800, loss[loss=0.2157, simple_loss=0.2911, pruned_loss=0.07013, over 21883.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.3042, pruned_loss=0.07261, over 4275183.84 frames. ], batch size: 351, lr: 4.16e-03, grad_scale: 16.0 2023-06-25 06:34:16,708 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.72 vs. limit=15.0 2023-06-25 06:35:12,529 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1276782.0, ans=0.125 2023-06-25 06:35:32,733 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1276842.0, ans=0.125 2023-06-25 06:35:50,571 INFO [train.py:996] (0/4) Epoch 7, batch 29850, loss[loss=0.1953, simple_loss=0.2698, pruned_loss=0.0604, over 21559.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2996, pruned_loss=0.07048, over 4283877.02 frames. ], batch size: 212, lr: 4.16e-03, grad_scale: 16.0 2023-06-25 06:36:28,324 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.285e+02 2.948e+02 3.373e+02 4.045e+02 7.832e+02, threshold=6.745e+02, percent-clipped=1.0 2023-06-25 06:36:31,054 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.42 vs. limit=22.5 2023-06-25 06:36:31,070 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=6.81 vs. limit=22.5 2023-06-25 06:36:31,234 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.43 vs. limit=15.0 2023-06-25 06:37:06,509 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1277082.0, ans=0.125 2023-06-25 06:37:21,557 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1277142.0, ans=0.0 2023-06-25 06:37:36,840 INFO [train.py:996] (0/4) Epoch 7, batch 29900, loss[loss=0.2228, simple_loss=0.2973, pruned_loss=0.07414, over 21654.00 frames. ], tot_loss[loss=0.221, simple_loss=0.299, pruned_loss=0.07153, over 4284713.10 frames. ], batch size: 230, lr: 4.16e-03, grad_scale: 16.0 2023-06-25 06:37:53,316 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1277202.0, ans=0.0 2023-06-25 06:37:59,797 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1277202.0, ans=0.0 2023-06-25 06:38:01,674 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1277202.0, ans=0.015 2023-06-25 06:38:26,143 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1277322.0, ans=0.0 2023-06-25 06:39:23,190 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.32 vs. limit=15.0 2023-06-25 06:39:25,226 INFO [train.py:996] (0/4) Epoch 7, batch 29950, loss[loss=0.2444, simple_loss=0.3127, pruned_loss=0.08802, over 21422.00 frames. ], tot_loss[loss=0.227, simple_loss=0.3038, pruned_loss=0.07508, over 4280537.09 frames. ], batch size: 549, lr: 4.16e-03, grad_scale: 16.0 2023-06-25 06:39:32,774 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1277502.0, ans=0.0 2023-06-25 06:39:41,335 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1277502.0, ans=0.1 2023-06-25 06:39:42,040 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn1.whiten.whitening_limit, batch_count=1277502.0, ans=22.5 2023-06-25 06:40:08,694 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.749e+02 3.319e+02 4.450e+02 5.387e+02 9.920e+02, threshold=8.899e+02, percent-clipped=12.0 2023-06-25 06:40:21,603 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1277622.0, ans=0.125 2023-06-25 06:40:21,615 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1277622.0, ans=0.5 2023-06-25 06:41:11,878 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1277742.0, ans=0.125 2023-06-25 06:41:15,602 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.45 vs. limit=6.0 2023-06-25 06:41:19,266 INFO [train.py:996] (0/4) Epoch 7, batch 30000, loss[loss=0.2001, simple_loss=0.2994, pruned_loss=0.05039, over 21705.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3055, pruned_loss=0.0756, over 4280561.73 frames. ], batch size: 298, lr: 4.16e-03, grad_scale: 32.0 2023-06-25 06:41:19,267 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-25 06:41:39,216 INFO [train.py:1028] (0/4) Epoch 7, validation: loss=0.2493, simple_loss=0.346, pruned_loss=0.07628, over 1796401.00 frames. 2023-06-25 06:41:39,217 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23714MB 2023-06-25 06:41:51,583 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1277802.0, ans=0.125 2023-06-25 06:41:53,167 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1277802.0, ans=0.0 2023-06-25 06:43:20,599 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=19.11 vs. limit=22.5 2023-06-25 06:43:30,358 INFO [train.py:996] (0/4) Epoch 7, batch 30050, loss[loss=0.256, simple_loss=0.3608, pruned_loss=0.07567, over 21861.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.309, pruned_loss=0.07318, over 4273472.42 frames. ], batch size: 372, lr: 4.16e-03, grad_scale: 16.0 2023-06-25 06:43:34,452 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1278102.0, ans=0.1 2023-06-25 06:44:01,384 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.44 vs. limit=15.0 2023-06-25 06:44:03,087 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1278162.0, ans=0.125 2023-06-25 06:44:04,716 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1278162.0, ans=0.125 2023-06-25 06:44:05,683 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.420e+02 3.279e+02 4.155e+02 5.724e+02 1.149e+03, threshold=8.309e+02, percent-clipped=6.0 2023-06-25 06:44:08,461 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1278162.0, ans=0.0 2023-06-25 06:44:46,082 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1278282.0, ans=0.0 2023-06-25 06:44:49,379 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1278282.0, ans=0.125 2023-06-25 06:44:52,453 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1278282.0, ans=0.2 2023-06-25 06:45:01,171 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 06:45:04,557 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1278342.0, ans=0.0 2023-06-25 06:45:17,744 INFO [train.py:996] (0/4) Epoch 7, batch 30100, loss[loss=0.1893, simple_loss=0.2394, pruned_loss=0.06958, over 19979.00 frames. ], tot_loss[loss=0.226, simple_loss=0.307, pruned_loss=0.07247, over 4261328.29 frames. ], batch size: 702, lr: 4.16e-03, grad_scale: 16.0 2023-06-25 06:45:42,545 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 06:45:47,983 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1278462.0, ans=0.1 2023-06-25 06:45:48,593 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.whiten.whitening_limit, batch_count=1278462.0, ans=12.0 2023-06-25 06:46:12,326 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1278522.0, ans=0.0 2023-06-25 06:46:35,988 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1278582.0, ans=0.035 2023-06-25 06:46:53,805 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1278642.0, ans=0.125 2023-06-25 06:46:55,548 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1278642.0, ans=0.1 2023-06-25 06:47:10,893 INFO [train.py:996] (0/4) Epoch 7, batch 30150, loss[loss=0.2376, simple_loss=0.308, pruned_loss=0.08361, over 21362.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.3022, pruned_loss=0.07379, over 4265152.87 frames. ], batch size: 176, lr: 4.16e-03, grad_scale: 16.0 2023-06-25 06:47:20,889 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1278702.0, ans=0.5 2023-06-25 06:47:21,388 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.43 vs. limit=22.5 2023-06-25 06:47:47,381 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.588e+02 3.267e+02 3.809e+02 4.984e+02 9.103e+02, threshold=7.618e+02, percent-clipped=3.0 2023-06-25 06:48:24,572 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.76 vs. limit=22.5 2023-06-25 06:48:45,391 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1278942.0, ans=0.125 2023-06-25 06:48:56,777 INFO [train.py:996] (0/4) Epoch 7, batch 30200, loss[loss=0.2239, simple_loss=0.3204, pruned_loss=0.06374, over 21599.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.3057, pruned_loss=0.07291, over 4269360.68 frames. ], batch size: 414, lr: 4.16e-03, grad_scale: 16.0 2023-06-25 06:49:58,777 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1279122.0, ans=0.1 2023-06-25 06:50:59,078 INFO [train.py:996] (0/4) Epoch 7, batch 30250, loss[loss=0.2462, simple_loss=0.3475, pruned_loss=0.07245, over 21469.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3111, pruned_loss=0.07486, over 4263993.53 frames. ], batch size: 211, lr: 4.16e-03, grad_scale: 16.0 2023-06-25 06:51:11,033 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1279302.0, ans=0.0 2023-06-25 06:51:33,135 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.437e+02 3.334e+02 4.601e+02 6.960e+02 1.343e+03, threshold=9.203e+02, percent-clipped=16.0 2023-06-25 06:52:41,300 INFO [train.py:996] (0/4) Epoch 7, batch 30300, loss[loss=0.181, simple_loss=0.2392, pruned_loss=0.0614, over 20698.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.308, pruned_loss=0.07461, over 4260280.05 frames. ], batch size: 607, lr: 4.15e-03, grad_scale: 16.0 2023-06-25 06:53:17,723 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1279662.0, ans=0.125 2023-06-25 06:53:33,533 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1279722.0, ans=0.0 2023-06-25 06:53:50,140 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1279782.0, ans=0.0 2023-06-25 06:54:37,800 INFO [train.py:996] (0/4) Epoch 7, batch 30350, loss[loss=0.2409, simple_loss=0.3405, pruned_loss=0.0707, over 20778.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.3086, pruned_loss=0.07576, over 4265736.19 frames. ], batch size: 607, lr: 4.15e-03, grad_scale: 16.0 2023-06-25 06:54:52,110 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1279902.0, ans=0.125 2023-06-25 06:54:59,693 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1279962.0, ans=0.2 2023-06-25 06:55:05,064 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.424e+02 3.772e+02 4.635e+02 6.721e+02 1.384e+03, threshold=9.269e+02, percent-clipped=9.0 2023-06-25 06:55:21,941 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1280022.0, ans=0.0 2023-06-25 06:56:00,348 INFO [train.py:996] (0/4) Epoch 7, batch 30400, loss[loss=0.2064, simple_loss=0.2585, pruned_loss=0.07715, over 20256.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.3024, pruned_loss=0.07464, over 4250669.36 frames. ], batch size: 703, lr: 4.15e-03, grad_scale: 32.0 2023-06-25 06:57:33,142 INFO [train.py:996] (0/4) Epoch 7, batch 30450, loss[loss=0.2541, simple_loss=0.354, pruned_loss=0.07712, over 19917.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.3023, pruned_loss=0.0742, over 4193773.52 frames. ], batch size: 702, lr: 4.15e-03, grad_scale: 32.0 2023-06-25 06:57:39,248 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1280502.0, ans=0.025 2023-06-25 06:57:56,752 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1280562.0, ans=0.0 2023-06-25 06:58:02,612 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.728e+02 6.501e+02 9.013e+02 1.486e+03 3.895e+03, threshold=1.803e+03, percent-clipped=46.0 2023-06-25 06:58:45,748 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/epoch-7.pt 2023-06-25 07:01:02,056 INFO [train.py:996] (0/4) Epoch 8, batch 0, loss[loss=0.2229, simple_loss=0.2915, pruned_loss=0.07717, over 21658.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.2915, pruned_loss=0.07717, over 21658.00 frames. ], batch size: 333, lr: 3.86e-03, grad_scale: 32.0 2023-06-25 07:01:02,057 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-25 07:01:19,566 INFO [train.py:1028] (0/4) Epoch 8, validation: loss=0.2406, simple_loss=0.3467, pruned_loss=0.06724, over 1796401.00 frames. 2023-06-25 07:01:19,567 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23714MB 2023-06-25 07:03:05,853 INFO [train.py:996] (0/4) Epoch 8, batch 50, loss[loss=0.252, simple_loss=0.3372, pruned_loss=0.08339, over 21771.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.2995, pruned_loss=0.07142, over 952497.73 frames. ], batch size: 351, lr: 3.86e-03, grad_scale: 32.0 2023-06-25 07:03:31,472 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1281132.0, ans=0.0 2023-06-25 07:03:44,207 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.93 vs. limit=6.0 2023-06-25 07:03:49,765 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.736e+02 3.478e+02 5.204e+02 1.094e+03 2.896e+03, threshold=1.041e+03, percent-clipped=7.0 2023-06-25 07:04:51,363 INFO [train.py:996] (0/4) Epoch 8, batch 100, loss[loss=0.2495, simple_loss=0.3369, pruned_loss=0.08109, over 21348.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.3169, pruned_loss=0.07524, over 1685779.59 frames. ], batch size: 159, lr: 3.86e-03, grad_scale: 16.0 2023-06-25 07:05:19,074 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.69 vs. limit=22.5 2023-06-25 07:06:01,668 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1281552.0, ans=0.2 2023-06-25 07:06:37,755 INFO [train.py:996] (0/4) Epoch 8, batch 150, loss[loss=0.2626, simple_loss=0.3423, pruned_loss=0.09143, over 21747.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.3227, pruned_loss=0.07623, over 2243688.28 frames. ], batch size: 441, lr: 3.86e-03, grad_scale: 16.0 2023-06-25 07:07:12,449 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1281732.0, ans=0.125 2023-06-25 07:07:27,458 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.475e+02 3.041e+02 3.436e+02 4.359e+02 9.068e+02, threshold=6.872e+02, percent-clipped=0.0 2023-06-25 07:07:51,224 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1281852.0, ans=0.125 2023-06-25 07:08:07,661 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1281912.0, ans=0.125 2023-06-25 07:08:18,588 INFO [train.py:996] (0/4) Epoch 8, batch 200, loss[loss=0.2019, simple_loss=0.2794, pruned_loss=0.06215, over 21156.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3221, pruned_loss=0.07585, over 2689140.07 frames. ], batch size: 159, lr: 3.86e-03, grad_scale: 16.0 2023-06-25 07:09:59,999 INFO [train.py:996] (0/4) Epoch 8, batch 250, loss[loss=0.2214, simple_loss=0.2887, pruned_loss=0.07708, over 21938.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3168, pruned_loss=0.07562, over 3043157.09 frames. ], batch size: 316, lr: 3.86e-03, grad_scale: 16.0 2023-06-25 07:10:18,028 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1282332.0, ans=0.1 2023-06-25 07:10:18,809 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.10 vs. limit=15.0 2023-06-25 07:10:45,288 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.611e+02 3.498e+02 4.445e+02 5.647e+02 1.101e+03, threshold=8.891e+02, percent-clipped=14.0 2023-06-25 07:11:22,177 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.05 vs. limit=15.0 2023-06-25 07:11:27,048 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.55 vs. limit=15.0 2023-06-25 07:11:46,097 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1282512.0, ans=0.0 2023-06-25 07:11:49,074 INFO [train.py:996] (0/4) Epoch 8, batch 300, loss[loss=0.2331, simple_loss=0.3268, pruned_loss=0.06973, over 21663.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3109, pruned_loss=0.0748, over 3321391.81 frames. ], batch size: 389, lr: 3.86e-03, grad_scale: 16.0 2023-06-25 07:11:58,876 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1282572.0, ans=0.125 2023-06-25 07:12:14,737 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1282632.0, ans=10.0 2023-06-25 07:12:32,911 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1282692.0, ans=0.0 2023-06-25 07:13:39,781 INFO [train.py:996] (0/4) Epoch 8, batch 350, loss[loss=0.1886, simple_loss=0.2548, pruned_loss=0.06122, over 21647.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.3035, pruned_loss=0.07308, over 3532193.29 frames. ], batch size: 282, lr: 3.86e-03, grad_scale: 16.0 2023-06-25 07:13:46,169 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1282872.0, ans=0.125 2023-06-25 07:14:00,473 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.62 vs. limit=6.0 2023-06-25 07:14:13,675 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=1282932.0, ans=10.0 2023-06-25 07:14:30,127 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.317e+02 3.131e+02 3.897e+02 5.934e+02 1.239e+03, threshold=7.794e+02, percent-clipped=5.0 2023-06-25 07:15:06,563 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=10.38 vs. limit=12.0 2023-06-25 07:15:24,679 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1283112.0, ans=0.125 2023-06-25 07:15:27,532 INFO [train.py:996] (0/4) Epoch 8, batch 400, loss[loss=0.1978, simple_loss=0.2654, pruned_loss=0.06508, over 21830.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.2979, pruned_loss=0.07245, over 3706450.88 frames. ], batch size: 352, lr: 3.86e-03, grad_scale: 32.0 2023-06-25 07:15:31,453 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1283172.0, ans=0.125 2023-06-25 07:15:45,645 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1283232.0, ans=0.0 2023-06-25 07:16:45,075 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1283352.0, ans=0.125 2023-06-25 07:16:56,607 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1283352.0, ans=0.125 2023-06-25 07:17:16,283 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1283412.0, ans=0.125 2023-06-25 07:17:19,195 INFO [train.py:996] (0/4) Epoch 8, batch 450, loss[loss=0.2266, simple_loss=0.3201, pruned_loss=0.06656, over 21501.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.2942, pruned_loss=0.07068, over 3837998.94 frames. ], batch size: 471, lr: 3.86e-03, grad_scale: 32.0 2023-06-25 07:17:22,004 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1283472.0, ans=0.0 2023-06-25 07:17:23,718 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1283472.0, ans=0.09899494936611666 2023-06-25 07:17:29,165 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.96 vs. limit=15.0 2023-06-25 07:17:43,127 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.18 vs. limit=15.0 2023-06-25 07:18:16,643 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.309e+02 3.535e+02 4.359e+02 5.649e+02 1.208e+03, threshold=8.718e+02, percent-clipped=9.0 2023-06-25 07:19:01,980 INFO [train.py:996] (0/4) Epoch 8, batch 500, loss[loss=0.2325, simple_loss=0.3086, pruned_loss=0.07819, over 21247.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2949, pruned_loss=0.07022, over 3939318.50 frames. ], batch size: 159, lr: 3.86e-03, grad_scale: 16.0 2023-06-25 07:19:04,503 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1283772.0, ans=10.0 2023-06-25 07:19:43,796 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1283832.0, ans=0.125 2023-06-25 07:20:29,589 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1283952.0, ans=0.2 2023-06-25 07:20:49,174 INFO [train.py:996] (0/4) Epoch 8, batch 550, loss[loss=0.3521, simple_loss=0.4349, pruned_loss=0.1347, over 21468.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2983, pruned_loss=0.06965, over 4020796.25 frames. ], batch size: 507, lr: 3.85e-03, grad_scale: 16.0 2023-06-25 07:21:08,499 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1284072.0, ans=0.2 2023-06-25 07:21:13,736 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1284132.0, ans=0.1 2023-06-25 07:21:45,844 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.415e+02 3.578e+02 5.101e+02 7.574e+02 1.639e+03, threshold=1.020e+03, percent-clipped=17.0 2023-06-25 07:21:46,310 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1284192.0, ans=0.125 2023-06-25 07:21:52,929 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1284192.0, ans=0.0 2023-06-25 07:21:56,199 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1284252.0, ans=0.2 2023-06-25 07:22:28,849 INFO [train.py:996] (0/4) Epoch 8, batch 600, loss[loss=0.2939, simple_loss=0.3902, pruned_loss=0.09884, over 21606.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.3016, pruned_loss=0.0705, over 4083987.82 frames. ], batch size: 441, lr: 3.85e-03, grad_scale: 16.0 2023-06-25 07:22:55,799 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.31 vs. limit=15.0 2023-06-25 07:24:05,393 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1284612.0, ans=0.125 2023-06-25 07:24:14,804 INFO [train.py:996] (0/4) Epoch 8, batch 650, loss[loss=0.2508, simple_loss=0.3725, pruned_loss=0.06456, over 20803.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.303, pruned_loss=0.07064, over 4131060.97 frames. ], batch size: 607, lr: 3.85e-03, grad_scale: 16.0 2023-06-25 07:24:36,281 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1284732.0, ans=0.0 2023-06-25 07:24:46,603 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1284732.0, ans=0.125 2023-06-25 07:24:58,859 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1284792.0, ans=0.07 2023-06-25 07:25:16,311 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.375e+02 3.313e+02 4.571e+02 7.176e+02 1.629e+03, threshold=9.143e+02, percent-clipped=10.0 2023-06-25 07:25:17,021 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1284792.0, ans=0.125 2023-06-25 07:25:38,764 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1284852.0, ans=0.2 2023-06-25 07:26:00,077 INFO [train.py:996] (0/4) Epoch 8, batch 700, loss[loss=0.2361, simple_loss=0.3101, pruned_loss=0.08106, over 21890.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.305, pruned_loss=0.07129, over 4167525.14 frames. ], batch size: 118, lr: 3.85e-03, grad_scale: 16.0 2023-06-25 07:26:04,266 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1284972.0, ans=0.1 2023-06-25 07:27:22,511 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1285152.0, ans=0.1 2023-06-25 07:27:44,317 INFO [train.py:996] (0/4) Epoch 8, batch 750, loss[loss=0.1986, simple_loss=0.265, pruned_loss=0.06611, over 15453.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.3021, pruned_loss=0.07136, over 4188386.64 frames. ], batch size: 61, lr: 3.85e-03, grad_scale: 16.0 2023-06-25 07:28:07,570 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1285332.0, ans=0.125 2023-06-25 07:28:10,851 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1285332.0, ans=0.125 2023-06-25 07:28:39,394 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1285392.0, ans=0.125 2023-06-25 07:28:47,157 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.536e+02 3.601e+02 4.438e+02 5.764e+02 1.140e+03, threshold=8.877e+02, percent-clipped=3.0 2023-06-25 07:29:32,232 INFO [train.py:996] (0/4) Epoch 8, batch 800, loss[loss=0.2168, simple_loss=0.2858, pruned_loss=0.07386, over 21849.00 frames. ], tot_loss[loss=0.221, simple_loss=0.2991, pruned_loss=0.07149, over 4214222.29 frames. ], batch size: 107, lr: 3.85e-03, grad_scale: 32.0 2023-06-25 07:29:57,866 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.12 vs. limit=15.0 2023-06-25 07:31:25,119 INFO [train.py:996] (0/4) Epoch 8, batch 850, loss[loss=0.2303, simple_loss=0.298, pruned_loss=0.08134, over 21880.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.2982, pruned_loss=0.07201, over 4231722.25 frames. ], batch size: 414, lr: 3.85e-03, grad_scale: 16.0 2023-06-25 07:31:37,457 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1285872.0, ans=0.125 2023-06-25 07:32:23,993 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.482e+02 3.234e+02 3.833e+02 4.866e+02 9.722e+02, threshold=7.666e+02, percent-clipped=1.0 2023-06-25 07:32:38,127 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1286052.0, ans=0.125 2023-06-25 07:32:43,008 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1286052.0, ans=0.04949747468305833 2023-06-25 07:33:13,045 INFO [train.py:996] (0/4) Epoch 8, batch 900, loss[loss=0.2539, simple_loss=0.2926, pruned_loss=0.1076, over 21453.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.2944, pruned_loss=0.0721, over 4245084.77 frames. ], batch size: 508, lr: 3.85e-03, grad_scale: 16.0 2023-06-25 07:33:28,031 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1286172.0, ans=0.1 2023-06-25 07:34:12,611 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.96 vs. limit=15.0 2023-06-25 07:34:55,168 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1286412.0, ans=0.125 2023-06-25 07:34:56,887 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1286412.0, ans=0.1 2023-06-25 07:35:01,276 INFO [train.py:996] (0/4) Epoch 8, batch 950, loss[loss=0.2125, simple_loss=0.2919, pruned_loss=0.06653, over 21861.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2917, pruned_loss=0.07139, over 4258190.66 frames. ], batch size: 298, lr: 3.85e-03, grad_scale: 16.0 2023-06-25 07:35:08,630 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1286472.0, ans=0.0 2023-06-25 07:35:54,407 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.907e+02 3.602e+02 4.628e+02 6.707e+02 1.446e+03, threshold=9.256e+02, percent-clipped=20.0 2023-06-25 07:36:01,949 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 07:36:42,697 INFO [train.py:996] (0/4) Epoch 8, batch 1000, loss[loss=0.2093, simple_loss=0.2908, pruned_loss=0.06388, over 21678.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.2937, pruned_loss=0.07209, over 4268252.31 frames. ], batch size: 389, lr: 3.85e-03, grad_scale: 16.0 2023-06-25 07:36:47,077 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1286772.0, ans=0.125 2023-06-25 07:38:04,465 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1286952.0, ans=0.2 2023-06-25 07:38:17,386 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.24 vs. limit=22.5 2023-06-25 07:38:31,311 INFO [train.py:996] (0/4) Epoch 8, batch 1050, loss[loss=0.1976, simple_loss=0.2762, pruned_loss=0.0595, over 21762.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2925, pruned_loss=0.07088, over 4265878.75 frames. ], batch size: 124, lr: 3.85e-03, grad_scale: 16.0 2023-06-25 07:39:30,742 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.534e+02 3.439e+02 4.407e+02 5.715e+02 1.308e+03, threshold=8.815e+02, percent-clipped=4.0 2023-06-25 07:40:06,943 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.15 vs. limit=22.5 2023-06-25 07:40:08,080 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1287312.0, ans=0.125 2023-06-25 07:40:19,291 INFO [train.py:996] (0/4) Epoch 8, batch 1100, loss[loss=0.1939, simple_loss=0.286, pruned_loss=0.05093, over 21721.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.2921, pruned_loss=0.0707, over 4271892.91 frames. ], batch size: 351, lr: 3.85e-03, grad_scale: 16.0 2023-06-25 07:41:04,059 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1287492.0, ans=0.2 2023-06-25 07:41:11,281 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1287492.0, ans=0.05 2023-06-25 07:41:23,446 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1287552.0, ans=0.125 2023-06-25 07:42:15,444 INFO [train.py:996] (0/4) Epoch 8, batch 1150, loss[loss=0.2143, simple_loss=0.2832, pruned_loss=0.07267, over 21836.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2924, pruned_loss=0.07079, over 4269174.15 frames. ], batch size: 107, lr: 3.85e-03, grad_scale: 16.0 2023-06-25 07:42:45,787 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1287732.0, ans=0.0 2023-06-25 07:42:53,386 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1287792.0, ans=0.0 2023-06-25 07:42:59,618 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.638e+02 3.529e+02 4.325e+02 5.726e+02 1.140e+03, threshold=8.649e+02, percent-clipped=5.0 2023-06-25 07:43:24,970 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1287852.0, ans=0.125 2023-06-25 07:43:50,585 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.88 vs. limit=12.0 2023-06-25 07:43:58,468 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1287972.0, ans=0.0 2023-06-25 07:43:59,491 INFO [train.py:996] (0/4) Epoch 8, batch 1200, loss[loss=0.2551, simple_loss=0.3353, pruned_loss=0.08746, over 21583.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.2953, pruned_loss=0.07089, over 4272566.08 frames. ], batch size: 471, lr: 3.85e-03, grad_scale: 32.0 2023-06-25 07:44:02,005 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1287972.0, ans=0.0 2023-06-25 07:44:03,719 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1287972.0, ans=0.2 2023-06-25 07:44:11,614 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1287972.0, ans=0.125 2023-06-25 07:44:14,898 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1288032.0, ans=0.125 2023-06-25 07:44:41,936 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1288092.0, ans=0.125 2023-06-25 07:44:51,036 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1288092.0, ans=0.125 2023-06-25 07:44:52,878 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1288092.0, ans=0.07 2023-06-25 07:45:14,045 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1288212.0, ans=0.125 2023-06-25 07:45:14,046 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1288212.0, ans=0.125 2023-06-25 07:45:47,967 INFO [train.py:996] (0/4) Epoch 8, batch 1250, loss[loss=0.1928, simple_loss=0.2302, pruned_loss=0.07772, over 20120.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.2955, pruned_loss=0.07049, over 4273255.14 frames. ], batch size: 703, lr: 3.85e-03, grad_scale: 32.0 2023-06-25 07:46:01,065 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1288272.0, ans=0.125 2023-06-25 07:46:06,693 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 07:46:11,869 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1288332.0, ans=0.0 2023-06-25 07:46:12,405 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.81 vs. limit=10.0 2023-06-25 07:46:20,603 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1288332.0, ans=0.1 2023-06-25 07:46:24,051 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1288392.0, ans=0.0 2023-06-25 07:46:27,741 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1288392.0, ans=0.125 2023-06-25 07:46:38,021 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.488e+02 3.316e+02 4.127e+02 5.335e+02 1.234e+03, threshold=8.255e+02, percent-clipped=5.0 2023-06-25 07:47:36,810 INFO [train.py:996] (0/4) Epoch 8, batch 1300, loss[loss=0.241, simple_loss=0.3134, pruned_loss=0.08434, over 21802.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.2981, pruned_loss=0.07105, over 4272921.43 frames. ], batch size: 441, lr: 3.85e-03, grad_scale: 16.0 2023-06-25 07:47:58,215 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1288632.0, ans=0.1 2023-06-25 07:47:59,919 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1288632.0, ans=0.125 2023-06-25 07:48:05,826 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.31 vs. limit=10.0 2023-06-25 07:48:41,712 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1288752.0, ans=0.125 2023-06-25 07:49:21,406 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1288812.0, ans=0.0 2023-06-25 07:49:25,889 INFO [train.py:996] (0/4) Epoch 8, batch 1350, loss[loss=0.2772, simple_loss=0.3413, pruned_loss=0.1065, over 21406.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.2996, pruned_loss=0.07169, over 4281267.33 frames. ], batch size: 509, lr: 3.85e-03, grad_scale: 16.0 2023-06-25 07:50:15,511 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.606e+02 3.456e+02 4.378e+02 5.897e+02 1.151e+03, threshold=8.757e+02, percent-clipped=2.0 2023-06-25 07:50:17,887 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1288992.0, ans=0.1 2023-06-25 07:50:29,800 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1289052.0, ans=0.125 2023-06-25 07:50:44,436 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.86 vs. limit=15.0 2023-06-25 07:50:53,475 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1289112.0, ans=0.0 2023-06-25 07:51:08,365 INFO [train.py:996] (0/4) Epoch 8, batch 1400, loss[loss=0.193, simple_loss=0.2638, pruned_loss=0.0611, over 21717.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.2972, pruned_loss=0.07164, over 4282863.56 frames. ], batch size: 332, lr: 3.85e-03, grad_scale: 16.0 2023-06-25 07:51:10,805 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1289172.0, ans=0.125 2023-06-25 07:51:31,154 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1289232.0, ans=0.025 2023-06-25 07:52:07,904 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1289352.0, ans=0.0 2023-06-25 07:52:49,693 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1289412.0, ans=0.1 2023-06-25 07:52:57,305 INFO [train.py:996] (0/4) Epoch 8, batch 1450, loss[loss=0.2405, simple_loss=0.3113, pruned_loss=0.08488, over 21415.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.2986, pruned_loss=0.07301, over 4276452.13 frames. ], batch size: 131, lr: 3.85e-03, grad_scale: 16.0 2023-06-25 07:53:15,694 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1289532.0, ans=0.125 2023-06-25 07:53:48,354 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.564e+02 3.442e+02 4.414e+02 6.258e+02 1.881e+03, threshold=8.827e+02, percent-clipped=13.0 2023-06-25 07:53:51,706 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1289592.0, ans=0.125 2023-06-25 07:54:00,385 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1289652.0, ans=0.125 2023-06-25 07:54:05,626 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1289652.0, ans=0.0 2023-06-25 07:54:39,145 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1289712.0, ans=0.125 2023-06-25 07:54:47,190 INFO [train.py:996] (0/4) Epoch 8, batch 1500, loss[loss=0.2196, simple_loss=0.2904, pruned_loss=0.07438, over 21942.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.3011, pruned_loss=0.07454, over 4283076.54 frames. ], batch size: 118, lr: 3.85e-03, grad_scale: 8.0 2023-06-25 07:54:56,991 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1289772.0, ans=0.0 2023-06-25 07:55:24,575 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1289892.0, ans=0.125 2023-06-25 07:55:44,585 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.34 vs. limit=15.0 2023-06-25 07:55:49,542 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1289952.0, ans=0.1 2023-06-25 07:56:28,601 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1290012.0, ans=0.125 2023-06-25 07:56:40,534 INFO [train.py:996] (0/4) Epoch 8, batch 1550, loss[loss=0.1781, simple_loss=0.2667, pruned_loss=0.04473, over 21580.00 frames. ], tot_loss[loss=0.221, simple_loss=0.2979, pruned_loss=0.07209, over 4286121.21 frames. ], batch size: 389, lr: 3.85e-03, grad_scale: 8.0 2023-06-25 07:57:35,131 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.389e+02 3.681e+02 5.239e+02 6.621e+02 1.108e+03, threshold=1.048e+03, percent-clipped=5.0 2023-06-25 07:57:35,649 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1290192.0, ans=0.125 2023-06-25 07:58:33,647 INFO [train.py:996] (0/4) Epoch 8, batch 1600, loss[loss=0.2015, simple_loss=0.276, pruned_loss=0.06348, over 21246.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.2965, pruned_loss=0.07136, over 4280393.18 frames. ], batch size: 176, lr: 3.85e-03, grad_scale: 16.0 2023-06-25 07:58:44,993 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1290372.0, ans=0.125 2023-06-25 08:00:12,187 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.15 vs. limit=12.0 2023-06-25 08:00:26,987 INFO [train.py:996] (0/4) Epoch 8, batch 1650, loss[loss=0.2645, simple_loss=0.3442, pruned_loss=0.09235, over 21834.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.297, pruned_loss=0.071, over 4287663.12 frames. ], batch size: 118, lr: 3.85e-03, grad_scale: 16.0 2023-06-25 08:00:44,116 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1290672.0, ans=0.125 2023-06-25 08:01:33,702 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1290792.0, ans=0.0 2023-06-25 08:01:38,153 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.521e+02 3.337e+02 4.261e+02 5.571e+02 1.006e+03, threshold=8.522e+02, percent-clipped=0.0 2023-06-25 08:02:20,395 INFO [train.py:996] (0/4) Epoch 8, batch 1700, loss[loss=0.2127, simple_loss=0.3076, pruned_loss=0.05893, over 21780.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.3023, pruned_loss=0.07278, over 4288150.07 frames. ], batch size: 282, lr: 3.84e-03, grad_scale: 16.0 2023-06-25 08:03:18,700 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1291092.0, ans=0.125 2023-06-25 08:03:33,117 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1291092.0, ans=0.0 2023-06-25 08:04:20,230 INFO [train.py:996] (0/4) Epoch 8, batch 1750, loss[loss=0.1588, simple_loss=0.2229, pruned_loss=0.04731, over 21387.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.2985, pruned_loss=0.06994, over 4284496.32 frames. ], batch size: 131, lr: 3.84e-03, grad_scale: 16.0 2023-06-25 08:05:07,888 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.02 vs. limit=6.0 2023-06-25 08:05:26,784 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.325e+02 3.271e+02 4.291e+02 6.912e+02 1.295e+03, threshold=8.582e+02, percent-clipped=12.0 2023-06-25 08:05:34,594 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1291452.0, ans=0.2 2023-06-25 08:06:19,512 INFO [train.py:996] (0/4) Epoch 8, batch 1800, loss[loss=0.2083, simple_loss=0.3132, pruned_loss=0.05169, over 21645.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.2995, pruned_loss=0.06879, over 4276246.40 frames. ], batch size: 414, lr: 3.84e-03, grad_scale: 16.0 2023-06-25 08:06:38,757 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.20 vs. limit=15.0 2023-06-25 08:06:44,879 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1291572.0, ans=0.0 2023-06-25 08:06:50,180 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1291632.0, ans=0.125 2023-06-25 08:07:49,451 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1291812.0, ans=0.125 2023-06-25 08:08:10,395 INFO [train.py:996] (0/4) Epoch 8, batch 1850, loss[loss=0.2405, simple_loss=0.3344, pruned_loss=0.07329, over 21537.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2998, pruned_loss=0.06797, over 4273008.25 frames. ], batch size: 473, lr: 3.84e-03, grad_scale: 16.0 2023-06-25 08:08:46,783 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1291932.0, ans=0.125 2023-06-25 08:09:08,863 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.511e+02 3.967e+02 5.452e+02 7.986e+02 1.937e+03, threshold=1.090e+03, percent-clipped=22.0 2023-06-25 08:09:22,639 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1292052.0, ans=0.2 2023-06-25 08:09:27,539 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1292052.0, ans=0.125 2023-06-25 08:09:43,911 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1292112.0, ans=0.0 2023-06-25 08:09:52,467 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1292112.0, ans=0.04949747468305833 2023-06-25 08:09:59,341 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1292172.0, ans=0.0 2023-06-25 08:10:05,941 INFO [train.py:996] (0/4) Epoch 8, batch 1900, loss[loss=0.2075, simple_loss=0.2982, pruned_loss=0.05839, over 21763.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2995, pruned_loss=0.06861, over 4277194.52 frames. ], batch size: 298, lr: 3.84e-03, grad_scale: 16.0 2023-06-25 08:10:23,411 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.90 vs. limit=10.0 2023-06-25 08:10:51,576 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.13 vs. limit=15.0 2023-06-25 08:11:14,999 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1292352.0, ans=0.125 2023-06-25 08:11:15,081 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1292352.0, ans=0.0 2023-06-25 08:11:18,828 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.85 vs. limit=22.5 2023-06-25 08:11:59,761 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1292412.0, ans=0.035 2023-06-25 08:12:03,405 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1292472.0, ans=0.1 2023-06-25 08:12:03,417 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1292472.0, ans=0.125 2023-06-25 08:12:04,363 INFO [train.py:996] (0/4) Epoch 8, batch 1950, loss[loss=0.1909, simple_loss=0.288, pruned_loss=0.04693, over 21634.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2973, pruned_loss=0.06972, over 4283129.59 frames. ], batch size: 263, lr: 3.84e-03, grad_scale: 16.0 2023-06-25 08:12:11,163 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1292472.0, ans=0.0 2023-06-25 08:12:38,736 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1292532.0, ans=0.125 2023-06-25 08:12:42,493 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1292592.0, ans=0.125 2023-06-25 08:13:00,219 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.618e+02 4.190e+02 5.257e+02 7.093e+02 1.583e+03, threshold=1.051e+03, percent-clipped=6.0 2023-06-25 08:13:04,848 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.97 vs. limit=6.0 2023-06-25 08:13:16,775 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1292652.0, ans=0.5 2023-06-25 08:13:52,825 INFO [train.py:996] (0/4) Epoch 8, batch 2000, loss[loss=0.1405, simple_loss=0.2123, pruned_loss=0.03434, over 21282.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2927, pruned_loss=0.06729, over 4285784.49 frames. ], batch size: 131, lr: 3.84e-03, grad_scale: 32.0 2023-06-25 08:13:57,383 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.23 vs. limit=22.5 2023-06-25 08:14:04,289 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.60 vs. limit=15.0 2023-06-25 08:14:12,269 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1292832.0, ans=0.0 2023-06-25 08:15:19,333 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1293012.0, ans=0.125 2023-06-25 08:15:44,220 INFO [train.py:996] (0/4) Epoch 8, batch 2050, loss[loss=0.2123, simple_loss=0.2848, pruned_loss=0.06991, over 21436.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.2953, pruned_loss=0.06707, over 4281073.09 frames. ], batch size: 144, lr: 3.84e-03, grad_scale: 16.0 2023-06-25 08:15:57,241 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1293072.0, ans=0.125 2023-06-25 08:16:20,440 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.07 vs. limit=6.0 2023-06-25 08:16:39,103 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.662e+02 4.169e+02 5.197e+02 7.491e+02 1.738e+03, threshold=1.039e+03, percent-clipped=10.0 2023-06-25 08:17:35,783 INFO [train.py:996] (0/4) Epoch 8, batch 2100, loss[loss=0.2772, simple_loss=0.3231, pruned_loss=0.1157, over 21389.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.2994, pruned_loss=0.06979, over 4276322.24 frames. ], batch size: 507, lr: 3.84e-03, grad_scale: 16.0 2023-06-25 08:18:02,762 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1293432.0, ans=0.0 2023-06-25 08:18:07,815 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1293432.0, ans=0.0 2023-06-25 08:18:27,561 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1293492.0, ans=0.0 2023-06-25 08:18:39,922 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1293552.0, ans=0.1 2023-06-25 08:19:27,057 INFO [train.py:996] (0/4) Epoch 8, batch 2150, loss[loss=0.2288, simple_loss=0.2889, pruned_loss=0.08434, over 21242.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.2972, pruned_loss=0.07132, over 4271417.32 frames. ], batch size: 143, lr: 3.84e-03, grad_scale: 16.0 2023-06-25 08:20:06,925 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1293792.0, ans=0.125 2023-06-25 08:20:23,096 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.526e+02 3.335e+02 3.972e+02 5.687e+02 1.021e+03, threshold=7.943e+02, percent-clipped=0.0 2023-06-25 08:20:25,705 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1293852.0, ans=0.125 2023-06-25 08:21:14,698 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1293912.0, ans=0.1 2023-06-25 08:21:19,277 INFO [train.py:996] (0/4) Epoch 8, batch 2200, loss[loss=0.1751, simple_loss=0.2499, pruned_loss=0.05021, over 21152.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.2994, pruned_loss=0.07177, over 4279978.48 frames. ], batch size: 143, lr: 3.84e-03, grad_scale: 16.0 2023-06-25 08:21:19,844 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1293972.0, ans=0.125 2023-06-25 08:21:37,735 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.80 vs. limit=15.0 2023-06-25 08:22:17,440 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1294152.0, ans=0.2 2023-06-25 08:22:23,016 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1294152.0, ans=0.1 2023-06-25 08:23:08,631 INFO [train.py:996] (0/4) Epoch 8, batch 2250, loss[loss=0.2502, simple_loss=0.3613, pruned_loss=0.06952, over 21214.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.2971, pruned_loss=0.07062, over 4284766.20 frames. ], batch size: 549, lr: 3.84e-03, grad_scale: 16.0 2023-06-25 08:23:09,288 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1294272.0, ans=0.125 2023-06-25 08:23:38,725 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1294332.0, ans=0.125 2023-06-25 08:24:02,842 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.466e+02 3.638e+02 4.452e+02 6.050e+02 1.629e+03, threshold=8.904e+02, percent-clipped=11.0 2023-06-25 08:24:52,812 INFO [train.py:996] (0/4) Epoch 8, batch 2300, loss[loss=0.2076, simple_loss=0.265, pruned_loss=0.0751, over 21116.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.2928, pruned_loss=0.07029, over 4275566.05 frames. ], batch size: 159, lr: 3.84e-03, grad_scale: 16.0 2023-06-25 08:25:09,141 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1294632.0, ans=0.05 2023-06-25 08:25:19,370 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1294632.0, ans=0.0 2023-06-25 08:25:30,111 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1294692.0, ans=0.035 2023-06-25 08:26:46,485 INFO [train.py:996] (0/4) Epoch 8, batch 2350, loss[loss=0.3024, simple_loss=0.3515, pruned_loss=0.1267, over 21404.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.2919, pruned_loss=0.07077, over 4275041.19 frames. ], batch size: 471, lr: 3.84e-03, grad_scale: 16.0 2023-06-25 08:27:41,217 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.567e+02 4.172e+02 5.399e+02 7.196e+02 1.286e+03, threshold=1.080e+03, percent-clipped=11.0 2023-06-25 08:28:37,760 INFO [train.py:996] (0/4) Epoch 8, batch 2400, loss[loss=0.2648, simple_loss=0.3355, pruned_loss=0.09709, over 21608.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.2954, pruned_loss=0.07196, over 4274107.36 frames. ], batch size: 415, lr: 3.84e-03, grad_scale: 32.0 2023-06-25 08:28:57,754 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1295232.0, ans=0.0 2023-06-25 08:29:15,952 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1295232.0, ans=0.2 2023-06-25 08:29:25,916 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.37 vs. limit=15.0 2023-06-25 08:30:05,546 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1295352.0, ans=0.125 2023-06-25 08:30:19,292 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1295412.0, ans=0.1 2023-06-25 08:30:27,361 INFO [train.py:996] (0/4) Epoch 8, batch 2450, loss[loss=0.2175, simple_loss=0.2902, pruned_loss=0.07244, over 21611.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.2991, pruned_loss=0.0732, over 4277395.05 frames. ], batch size: 212, lr: 3.84e-03, grad_scale: 16.0 2023-06-25 08:30:55,106 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1295532.0, ans=0.125 2023-06-25 08:31:05,183 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.81 vs. limit=10.0 2023-06-25 08:31:22,460 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.13 vs. limit=12.0 2023-06-25 08:31:24,812 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.579e+02 3.841e+02 6.208e+02 9.164e+02 1.809e+03, threshold=1.242e+03, percent-clipped=16.0 2023-06-25 08:31:29,950 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.13 vs. limit=15.0 2023-06-25 08:31:40,006 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1295652.0, ans=0.1 2023-06-25 08:31:52,751 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1295652.0, ans=0.5 2023-06-25 08:32:12,775 INFO [train.py:996] (0/4) Epoch 8, batch 2500, loss[loss=0.2048, simple_loss=0.2806, pruned_loss=0.06446, over 22017.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.2971, pruned_loss=0.07234, over 4278272.93 frames. ], batch size: 103, lr: 3.84e-03, grad_scale: 16.0 2023-06-25 08:32:13,902 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.27 vs. limit=15.0 2023-06-25 08:32:17,725 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.67 vs. limit=15.0 2023-06-25 08:32:40,526 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=16.38 vs. limit=22.5 2023-06-25 08:32:41,713 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1295832.0, ans=0.125 2023-06-25 08:32:52,785 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1295892.0, ans=0.0 2023-06-25 08:33:34,970 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-216000.pt 2023-06-25 08:33:54,745 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1296012.0, ans=0.0 2023-06-25 08:33:59,248 INFO [train.py:996] (0/4) Epoch 8, batch 2550, loss[loss=0.2137, simple_loss=0.2803, pruned_loss=0.07351, over 21450.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.2954, pruned_loss=0.07191, over 4271205.01 frames. ], batch size: 389, lr: 3.84e-03, grad_scale: 16.0 2023-06-25 08:34:01,876 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1296072.0, ans=0.0 2023-06-25 08:34:02,016 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1296072.0, ans=0.2 2023-06-25 08:34:50,671 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.50 vs. limit=15.0 2023-06-25 08:34:56,188 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.632e+02 3.347e+02 3.968e+02 6.148e+02 1.129e+03, threshold=7.936e+02, percent-clipped=0.0 2023-06-25 08:35:24,230 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 08:35:49,675 INFO [train.py:996] (0/4) Epoch 8, batch 2600, loss[loss=0.2247, simple_loss=0.316, pruned_loss=0.06671, over 21449.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.2961, pruned_loss=0.07378, over 4275146.44 frames. ], batch size: 211, lr: 3.84e-03, grad_scale: 16.0 2023-06-25 08:36:12,909 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1296432.0, ans=0.0 2023-06-25 08:36:37,178 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.62 vs. limit=22.5 2023-06-25 08:37:35,798 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1296612.0, ans=0.0 2023-06-25 08:37:40,600 INFO [train.py:996] (0/4) Epoch 8, batch 2650, loss[loss=0.2254, simple_loss=0.3143, pruned_loss=0.06826, over 21689.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.2965, pruned_loss=0.07348, over 4274242.32 frames. ], batch size: 389, lr: 3.84e-03, grad_scale: 16.0 2023-06-25 08:38:37,358 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.620e+02 3.828e+02 4.857e+02 7.020e+02 1.360e+03, threshold=9.714e+02, percent-clipped=21.0 2023-06-25 08:38:41,654 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1296852.0, ans=0.1 2023-06-25 08:38:47,129 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.44 vs. limit=15.0 2023-06-25 08:38:57,745 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1296852.0, ans=0.07 2023-06-25 08:39:20,166 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1296912.0, ans=0.125 2023-06-25 08:39:24,887 INFO [train.py:996] (0/4) Epoch 8, batch 2700, loss[loss=0.2079, simple_loss=0.2791, pruned_loss=0.06831, over 21676.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.2946, pruned_loss=0.07255, over 4280691.55 frames. ], batch size: 298, lr: 3.84e-03, grad_scale: 16.0 2023-06-25 08:39:39,494 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1296972.0, ans=0.125 2023-06-25 08:39:42,991 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1297032.0, ans=0.125 2023-06-25 08:40:20,397 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1297092.0, ans=0.2 2023-06-25 08:40:45,669 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1297152.0, ans=0.125 2023-06-25 08:41:02,232 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1297212.0, ans=0.125 2023-06-25 08:41:03,897 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1297212.0, ans=0.125 2023-06-25 08:41:17,926 INFO [train.py:996] (0/4) Epoch 8, batch 2750, loss[loss=0.2388, simple_loss=0.3136, pruned_loss=0.08204, over 21341.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.295, pruned_loss=0.07326, over 4277081.32 frames. ], batch size: 143, lr: 3.84e-03, grad_scale: 16.0 2023-06-25 08:41:39,380 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.47 vs. limit=15.0 2023-06-25 08:42:21,206 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1297392.0, ans=0.0 2023-06-25 08:42:27,793 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.672e+02 4.055e+02 5.362e+02 7.595e+02 1.481e+03, threshold=1.072e+03, percent-clipped=12.0 2023-06-25 08:42:30,090 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1297452.0, ans=0.015 2023-06-25 08:42:57,383 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1297512.0, ans=0.125 2023-06-25 08:42:57,580 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1297512.0, ans=0.125 2023-06-25 08:43:11,596 INFO [train.py:996] (0/4) Epoch 8, batch 2800, loss[loss=0.2129, simple_loss=0.3336, pruned_loss=0.04607, over 19764.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.2984, pruned_loss=0.0736, over 4278908.61 frames. ], batch size: 702, lr: 3.83e-03, grad_scale: 32.0 2023-06-25 08:43:18,287 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.45 vs. limit=22.5 2023-06-25 08:43:22,789 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1297572.0, ans=0.2 2023-06-25 08:43:25,719 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.98 vs. limit=15.0 2023-06-25 08:44:04,407 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1297692.0, ans=0.125 2023-06-25 08:44:26,864 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1297752.0, ans=0.1 2023-06-25 08:44:42,928 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1297812.0, ans=0.125 2023-06-25 08:44:53,469 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1297812.0, ans=0.2 2023-06-25 08:44:59,924 INFO [train.py:996] (0/4) Epoch 8, batch 2850, loss[loss=0.2576, simple_loss=0.32, pruned_loss=0.09764, over 21755.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.3008, pruned_loss=0.0748, over 4283813.75 frames. ], batch size: 441, lr: 3.83e-03, grad_scale: 32.0 2023-06-25 08:45:15,376 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1297872.0, ans=0.025 2023-06-25 08:45:22,116 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 08:45:27,042 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1297932.0, ans=0.125 2023-06-25 08:46:10,497 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1297992.0, ans=0.1 2023-06-25 08:46:13,118 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.766e+02 3.662e+02 5.066e+02 7.139e+02 1.545e+03, threshold=1.013e+03, percent-clipped=5.0 2023-06-25 08:46:36,786 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 08:46:50,054 INFO [train.py:996] (0/4) Epoch 8, batch 2900, loss[loss=0.1578, simple_loss=0.2206, pruned_loss=0.04749, over 21194.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.2967, pruned_loss=0.0737, over 4281768.99 frames. ], batch size: 176, lr: 3.83e-03, grad_scale: 16.0 2023-06-25 08:47:27,517 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1298232.0, ans=0.2 2023-06-25 08:48:42,069 INFO [train.py:996] (0/4) Epoch 8, batch 2950, loss[loss=0.2492, simple_loss=0.348, pruned_loss=0.07518, over 20847.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.2992, pruned_loss=0.07351, over 4283041.06 frames. ], batch size: 607, lr: 3.83e-03, grad_scale: 16.0 2023-06-25 08:48:43,319 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.27 vs. limit=22.5 2023-06-25 08:48:46,294 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1298472.0, ans=0.0 2023-06-25 08:49:14,949 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 08:49:19,452 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.29 vs. limit=6.0 2023-06-25 08:49:36,551 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1298592.0, ans=0.125 2023-06-25 08:49:52,433 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1298592.0, ans=0.125 2023-06-25 08:49:56,103 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1298652.0, ans=0.125 2023-06-25 08:49:57,029 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.695e+02 3.497e+02 4.851e+02 7.009e+02 1.350e+03, threshold=9.702e+02, percent-clipped=11.0 2023-06-25 08:50:33,065 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.65 vs. limit=22.5 2023-06-25 08:50:33,463 INFO [train.py:996] (0/4) Epoch 8, batch 3000, loss[loss=0.226, simple_loss=0.2885, pruned_loss=0.08171, over 21419.00 frames. ], tot_loss[loss=0.227, simple_loss=0.3045, pruned_loss=0.0747, over 4283898.26 frames. ], batch size: 211, lr: 3.83e-03, grad_scale: 16.0 2023-06-25 08:50:33,465 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-25 08:50:54,965 INFO [train.py:1028] (0/4) Epoch 8, validation: loss=0.2557, simple_loss=0.3462, pruned_loss=0.08265, over 1796401.00 frames. 2023-06-25 08:50:54,967 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23714MB 2023-06-25 08:51:54,482 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1298892.0, ans=0.125 2023-06-25 08:52:45,467 INFO [train.py:996] (0/4) Epoch 8, batch 3050, loss[loss=0.1711, simple_loss=0.2533, pruned_loss=0.04444, over 21410.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.3022, pruned_loss=0.0733, over 4283048.96 frames. ], batch size: 194, lr: 3.83e-03, grad_scale: 16.0 2023-06-25 08:52:51,251 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1299072.0, ans=0.0 2023-06-25 08:53:45,786 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1299192.0, ans=0.0 2023-06-25 08:53:51,094 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1299192.0, ans=0.04949747468305833 2023-06-25 08:53:55,757 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.393e+02 3.327e+02 3.997e+02 5.438e+02 1.383e+03, threshold=7.994e+02, percent-clipped=4.0 2023-06-25 08:54:37,061 INFO [train.py:996] (0/4) Epoch 8, batch 3100, loss[loss=0.2008, simple_loss=0.2931, pruned_loss=0.05426, over 21826.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.3004, pruned_loss=0.07116, over 4284540.60 frames. ], batch size: 282, lr: 3.83e-03, grad_scale: 16.0 2023-06-25 08:54:57,397 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1299372.0, ans=0.0 2023-06-25 08:55:03,190 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.03 vs. limit=15.0 2023-06-25 08:55:16,256 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1299432.0, ans=0.125 2023-06-25 08:55:32,907 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.03 vs. limit=22.5 2023-06-25 08:55:35,584 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1299492.0, ans=0.125 2023-06-25 08:56:01,302 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1299552.0, ans=0.125 2023-06-25 08:56:06,495 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1299552.0, ans=0.2 2023-06-25 08:56:39,308 INFO [train.py:996] (0/4) Epoch 8, batch 3150, loss[loss=0.3298, simple_loss=0.3796, pruned_loss=0.14, over 21386.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.3041, pruned_loss=0.07314, over 4284500.89 frames. ], batch size: 509, lr: 3.83e-03, grad_scale: 16.0 2023-06-25 08:57:44,758 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.490e+02 3.426e+02 4.350e+02 5.969e+02 1.538e+03, threshold=8.700e+02, percent-clipped=12.0 2023-06-25 08:58:11,709 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1299912.0, ans=0.125 2023-06-25 08:58:36,639 INFO [train.py:996] (0/4) Epoch 8, batch 3200, loss[loss=0.2463, simple_loss=0.3251, pruned_loss=0.08374, over 21778.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.3048, pruned_loss=0.07332, over 4284003.33 frames. ], batch size: 124, lr: 3.83e-03, grad_scale: 32.0 2023-06-25 08:59:11,954 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1300092.0, ans=0.2 2023-06-25 08:59:28,829 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.32 vs. limit=10.0 2023-06-25 09:00:28,334 INFO [train.py:996] (0/4) Epoch 8, batch 3250, loss[loss=0.2293, simple_loss=0.2966, pruned_loss=0.08102, over 21585.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.3052, pruned_loss=0.07462, over 4284633.71 frames. ], batch size: 230, lr: 3.83e-03, grad_scale: 32.0 2023-06-25 09:00:49,189 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1300332.0, ans=0.0 2023-06-25 09:01:30,043 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.648e+02 3.942e+02 5.285e+02 9.066e+02 2.066e+03, threshold=1.057e+03, percent-clipped=29.0 2023-06-25 09:01:59,914 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1300512.0, ans=0.0 2023-06-25 09:02:20,253 INFO [train.py:996] (0/4) Epoch 8, batch 3300, loss[loss=0.2, simple_loss=0.282, pruned_loss=0.05903, over 21570.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.3009, pruned_loss=0.07416, over 4287307.75 frames. ], batch size: 230, lr: 3.83e-03, grad_scale: 16.0 2023-06-25 09:02:41,276 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.08 vs. limit=15.0 2023-06-25 09:03:48,189 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.66 vs. limit=6.0 2023-06-25 09:04:00,198 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1300812.0, ans=0.125 2023-06-25 09:04:11,674 INFO [train.py:996] (0/4) Epoch 8, batch 3350, loss[loss=0.2248, simple_loss=0.3108, pruned_loss=0.06943, over 21483.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.3023, pruned_loss=0.07497, over 4284039.03 frames. ], batch size: 131, lr: 3.83e-03, grad_scale: 16.0 2023-06-25 09:04:53,950 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1300992.0, ans=0.125 2023-06-25 09:04:55,822 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1300992.0, ans=0.0 2023-06-25 09:05:23,413 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.477e+02 4.026e+02 5.637e+02 8.126e+02 1.843e+03, threshold=1.127e+03, percent-clipped=12.0 2023-06-25 09:05:27,916 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1301052.0, ans=0.95 2023-06-25 09:06:01,860 INFO [train.py:996] (0/4) Epoch 8, batch 3400, loss[loss=0.1751, simple_loss=0.263, pruned_loss=0.0436, over 21443.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.3011, pruned_loss=0.07481, over 4288773.63 frames. ], batch size: 211, lr: 3.83e-03, grad_scale: 16.0 2023-06-25 09:06:15,331 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1301172.0, ans=0.125 2023-06-25 09:06:26,204 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1301232.0, ans=0.125 2023-06-25 09:06:33,634 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1301232.0, ans=0.1 2023-06-25 09:06:43,136 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1301292.0, ans=0.0 2023-06-25 09:06:43,709 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.13 vs. limit=15.0 2023-06-25 09:07:55,874 INFO [train.py:996] (0/4) Epoch 8, batch 3450, loss[loss=0.2354, simple_loss=0.2982, pruned_loss=0.0863, over 21576.00 frames. ], tot_loss[loss=0.223, simple_loss=0.2972, pruned_loss=0.07442, over 4275001.20 frames. ], batch size: 548, lr: 3.83e-03, grad_scale: 16.0 2023-06-25 09:09:13,939 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.637e+02 3.588e+02 4.974e+02 7.725e+02 1.763e+03, threshold=9.948e+02, percent-clipped=11.0 2023-06-25 09:09:53,628 INFO [train.py:996] (0/4) Epoch 8, batch 3500, loss[loss=0.2725, simple_loss=0.3449, pruned_loss=0.1, over 21233.00 frames. ], tot_loss[loss=0.229, simple_loss=0.3043, pruned_loss=0.07689, over 4269826.61 frames. ], batch size: 143, lr: 3.83e-03, grad_scale: 16.0 2023-06-25 09:10:10,961 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.96 vs. limit=12.0 2023-06-25 09:11:43,840 INFO [train.py:996] (0/4) Epoch 8, batch 3550, loss[loss=0.2155, simple_loss=0.2876, pruned_loss=0.07174, over 21847.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3076, pruned_loss=0.07826, over 4274271.83 frames. ], batch size: 372, lr: 3.83e-03, grad_scale: 16.0 2023-06-25 09:12:55,363 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.560e+02 4.003e+02 5.383e+02 7.230e+02 1.174e+03, threshold=1.077e+03, percent-clipped=7.0 2023-06-25 09:13:21,021 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.27 vs. limit=6.0 2023-06-25 09:13:22,157 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1302312.0, ans=0.125 2023-06-25 09:13:23,925 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1302312.0, ans=0.125 2023-06-25 09:13:32,255 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1302312.0, ans=0.1 2023-06-25 09:13:35,414 INFO [train.py:996] (0/4) Epoch 8, batch 3600, loss[loss=0.2183, simple_loss=0.3184, pruned_loss=0.05915, over 20799.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3028, pruned_loss=0.07782, over 4271968.50 frames. ], batch size: 607, lr: 3.83e-03, grad_scale: 32.0 2023-06-25 09:14:15,436 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1302492.0, ans=0.125 2023-06-25 09:14:52,728 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1302552.0, ans=0.125 2023-06-25 09:15:04,913 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1302612.0, ans=0.125 2023-06-25 09:15:18,563 INFO [train.py:996] (0/4) Epoch 8, batch 3650, loss[loss=0.2005, simple_loss=0.28, pruned_loss=0.06055, over 21434.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3026, pruned_loss=0.07712, over 4273475.96 frames. ], batch size: 131, lr: 3.83e-03, grad_scale: 32.0 2023-06-25 09:15:29,906 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1302672.0, ans=0.0 2023-06-25 09:16:31,581 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.652e+02 4.088e+02 5.545e+02 7.819e+02 1.547e+03, threshold=1.109e+03, percent-clipped=4.0 2023-06-25 09:16:58,290 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1302912.0, ans=0.2 2023-06-25 09:16:59,676 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1302912.0, ans=0.0 2023-06-25 09:17:09,598 INFO [train.py:996] (0/4) Epoch 8, batch 3700, loss[loss=0.225, simple_loss=0.302, pruned_loss=0.07402, over 22034.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.301, pruned_loss=0.07664, over 4279918.45 frames. ], batch size: 119, lr: 3.83e-03, grad_scale: 16.0 2023-06-25 09:17:44,735 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1303032.0, ans=0.125 2023-06-25 09:18:46,201 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1303212.0, ans=0.2 2023-06-25 09:18:47,755 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1303212.0, ans=0.95 2023-06-25 09:19:01,428 INFO [train.py:996] (0/4) Epoch 8, batch 3750, loss[loss=0.2469, simple_loss=0.3129, pruned_loss=0.09047, over 21750.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.2993, pruned_loss=0.07596, over 4289066.97 frames. ], batch size: 441, lr: 3.83e-03, grad_scale: 16.0 2023-06-25 09:19:41,744 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1303332.0, ans=0.125 2023-06-25 09:20:21,196 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.521e+02 3.272e+02 4.501e+02 6.560e+02 9.292e+02, threshold=9.001e+02, percent-clipped=0.0 2023-06-25 09:20:58,397 INFO [train.py:996] (0/4) Epoch 8, batch 3800, loss[loss=0.2603, simple_loss=0.3266, pruned_loss=0.09701, over 21536.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.2985, pruned_loss=0.07493, over 4288682.73 frames. ], batch size: 473, lr: 3.83e-03, grad_scale: 16.0 2023-06-25 09:21:11,847 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1303572.0, ans=0.0 2023-06-25 09:21:31,169 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1303632.0, ans=0.125 2023-06-25 09:22:12,738 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1303752.0, ans=0.0 2023-06-25 09:22:37,911 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1303812.0, ans=0.125 2023-06-25 09:22:40,698 INFO [train.py:996] (0/4) Epoch 8, batch 3850, loss[loss=0.2289, simple_loss=0.3024, pruned_loss=0.07774, over 21403.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.2963, pruned_loss=0.07516, over 4294229.30 frames. ], batch size: 549, lr: 3.83e-03, grad_scale: 16.0 2023-06-25 09:22:53,708 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1303872.0, ans=0.0 2023-06-25 09:23:15,885 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1303932.0, ans=0.0 2023-06-25 09:23:49,877 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1304052.0, ans=0.1 2023-06-25 09:23:59,530 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.585e+02 3.373e+02 4.487e+02 6.167e+02 2.000e+03, threshold=8.974e+02, percent-clipped=6.0 2023-06-25 09:24:02,029 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1304052.0, ans=0.125 2023-06-25 09:24:20,091 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.95 vs. limit=10.0 2023-06-25 09:24:31,278 INFO [train.py:996] (0/4) Epoch 8, batch 3900, loss[loss=0.2477, simple_loss=0.3436, pruned_loss=0.0759, over 19740.00 frames. ], tot_loss[loss=0.221, simple_loss=0.2928, pruned_loss=0.07454, over 4288107.69 frames. ], batch size: 702, lr: 3.83e-03, grad_scale: 16.0 2023-06-25 09:24:49,083 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.72 vs. limit=15.0 2023-06-25 09:25:04,528 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1304232.0, ans=0.1 2023-06-25 09:25:26,108 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1304292.0, ans=0.5 2023-06-25 09:25:27,785 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1304292.0, ans=0.04949747468305833 2023-06-25 09:26:27,179 INFO [train.py:996] (0/4) Epoch 8, batch 3950, loss[loss=0.2473, simple_loss=0.341, pruned_loss=0.07676, over 21623.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.2956, pruned_loss=0.07382, over 4286615.46 frames. ], batch size: 441, lr: 3.82e-03, grad_scale: 16.0 2023-06-25 09:26:51,271 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1304532.0, ans=0.125 2023-06-25 09:27:02,322 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1304532.0, ans=0.0 2023-06-25 09:27:38,826 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.574e+02 3.686e+02 5.186e+02 7.402e+02 1.424e+03, threshold=1.037e+03, percent-clipped=9.0 2023-06-25 09:28:16,207 INFO [train.py:996] (0/4) Epoch 8, batch 4000, loss[loss=0.1994, simple_loss=0.2704, pruned_loss=0.06418, over 21827.00 frames. ], tot_loss[loss=0.216, simple_loss=0.2903, pruned_loss=0.07081, over 4275407.51 frames. ], batch size: 98, lr: 3.82e-03, grad_scale: 32.0 2023-06-25 09:28:29,702 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1304772.0, ans=0.125 2023-06-25 09:29:52,568 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1305012.0, ans=0.0 2023-06-25 09:30:11,501 INFO [train.py:996] (0/4) Epoch 8, batch 4050, loss[loss=0.1972, simple_loss=0.2941, pruned_loss=0.05011, over 21755.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2903, pruned_loss=0.06966, over 4279652.81 frames. ], batch size: 332, lr: 3.82e-03, grad_scale: 32.0 2023-06-25 09:30:36,047 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1305132.0, ans=0.05 2023-06-25 09:31:05,764 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1305192.0, ans=0.0 2023-06-25 09:31:16,110 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1305252.0, ans=0.0 2023-06-25 09:31:18,962 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.508e+02 3.803e+02 4.888e+02 6.657e+02 1.371e+03, threshold=9.776e+02, percent-clipped=4.0 2023-06-25 09:31:26,484 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1305252.0, ans=0.125 2023-06-25 09:31:59,989 INFO [train.py:996] (0/4) Epoch 8, batch 4100, loss[loss=0.219, simple_loss=0.2991, pruned_loss=0.06947, over 21392.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2899, pruned_loss=0.0695, over 4280878.85 frames. ], batch size: 131, lr: 3.82e-03, grad_scale: 16.0 2023-06-25 09:32:27,306 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1305432.0, ans=0.125 2023-06-25 09:32:45,212 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1305492.0, ans=0.125 2023-06-25 09:33:16,332 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1305552.0, ans=0.125 2023-06-25 09:33:27,706 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1305612.0, ans=0.125 2023-06-25 09:33:30,009 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.52 vs. limit=15.0 2023-06-25 09:33:45,938 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1305612.0, ans=0.0 2023-06-25 09:33:48,828 INFO [train.py:996] (0/4) Epoch 8, batch 4150, loss[loss=0.2222, simple_loss=0.2934, pruned_loss=0.07549, over 21632.00 frames. ], tot_loss[loss=0.2123, simple_loss=0.2906, pruned_loss=0.06699, over 4275334.18 frames. ], batch size: 263, lr: 3.82e-03, grad_scale: 8.0 2023-06-25 09:34:14,355 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1305732.0, ans=0.1 2023-06-25 09:34:29,039 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1305732.0, ans=0.125 2023-06-25 09:34:52,332 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1305852.0, ans=0.0 2023-06-25 09:35:00,799 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.587e+02 3.172e+02 3.844e+02 5.295e+02 7.953e+02, threshold=7.689e+02, percent-clipped=0.0 2023-06-25 09:35:08,895 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1305852.0, ans=0.125 2023-06-25 09:35:41,082 INFO [train.py:996] (0/4) Epoch 8, batch 4200, loss[loss=0.2583, simple_loss=0.3484, pruned_loss=0.08407, over 21849.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2911, pruned_loss=0.06716, over 4259467.74 frames. ], batch size: 372, lr: 3.82e-03, grad_scale: 8.0 2023-06-25 09:36:37,704 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1306092.0, ans=0.2 2023-06-25 09:37:38,052 INFO [train.py:996] (0/4) Epoch 8, batch 4250, loss[loss=0.2591, simple_loss=0.3764, pruned_loss=0.07092, over 21210.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.2993, pruned_loss=0.06923, over 4257737.33 frames. ], batch size: 549, lr: 3.82e-03, grad_scale: 8.0 2023-06-25 09:38:05,966 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1306332.0, ans=0.1 2023-06-25 09:38:57,636 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.607e+02 4.053e+02 6.185e+02 8.917e+02 1.733e+03, threshold=1.237e+03, percent-clipped=33.0 2023-06-25 09:39:07,957 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 09:39:16,896 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1306512.0, ans=0.07 2023-06-25 09:39:16,947 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1306512.0, ans=0.125 2023-06-25 09:39:35,354 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1306512.0, ans=0.035 2023-06-25 09:39:38,316 INFO [train.py:996] (0/4) Epoch 8, batch 4300, loss[loss=0.2881, simple_loss=0.3703, pruned_loss=0.103, over 20739.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.3055, pruned_loss=0.07184, over 4252248.88 frames. ], batch size: 607, lr: 3.82e-03, grad_scale: 8.0 2023-06-25 09:41:28,186 INFO [train.py:996] (0/4) Epoch 8, batch 4350, loss[loss=0.1848, simple_loss=0.2513, pruned_loss=0.05919, over 21404.00 frames. ], tot_loss[loss=0.2219, simple_loss=0.3028, pruned_loss=0.07052, over 4248247.35 frames. ], batch size: 131, lr: 3.82e-03, grad_scale: 8.0 2023-06-25 09:41:37,505 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1306872.0, ans=0.035 2023-06-25 09:42:31,286 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.97 vs. limit=15.0 2023-06-25 09:42:44,510 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.350e+02 3.580e+02 4.513e+02 6.539e+02 1.169e+03, threshold=9.025e+02, percent-clipped=0.0 2023-06-25 09:42:53,051 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.60 vs. limit=15.0 2023-06-25 09:43:16,032 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1307112.0, ans=0.2 2023-06-25 09:43:19,228 INFO [train.py:996] (0/4) Epoch 8, batch 4400, loss[loss=0.2243, simple_loss=0.3048, pruned_loss=0.07185, over 20727.00 frames. ], tot_loss[loss=0.219, simple_loss=0.2971, pruned_loss=0.0704, over 4247402.94 frames. ], batch size: 608, lr: 3.82e-03, grad_scale: 16.0 2023-06-25 09:43:32,487 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1307172.0, ans=0.125 2023-06-25 09:43:42,157 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.08 vs. limit=15.0 2023-06-25 09:44:02,291 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1307232.0, ans=0.1 2023-06-25 09:45:16,020 INFO [train.py:996] (0/4) Epoch 8, batch 4450, loss[loss=0.3076, simple_loss=0.3918, pruned_loss=0.1117, over 21630.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.308, pruned_loss=0.07264, over 4258970.20 frames. ], batch size: 441, lr: 3.82e-03, grad_scale: 16.0 2023-06-25 09:46:10,259 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1307592.0, ans=0.125 2023-06-25 09:46:12,144 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1307592.0, ans=0.125 2023-06-25 09:46:32,171 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.688e+02 3.788e+02 5.957e+02 8.951e+02 1.705e+03, threshold=1.191e+03, percent-clipped=23.0 2023-06-25 09:47:06,068 INFO [train.py:996] (0/4) Epoch 8, batch 4500, loss[loss=0.2073, simple_loss=0.2661, pruned_loss=0.0742, over 20220.00 frames. ], tot_loss[loss=0.228, simple_loss=0.308, pruned_loss=0.07402, over 4266121.33 frames. ], batch size: 702, lr: 3.82e-03, grad_scale: 16.0 2023-06-25 09:47:32,871 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1307832.0, ans=0.125 2023-06-25 09:47:34,610 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1307832.0, ans=0.0 2023-06-25 09:47:50,792 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1307892.0, ans=0.1 2023-06-25 09:48:17,277 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1307952.0, ans=0.1 2023-06-25 09:48:19,071 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1307952.0, ans=0.125 2023-06-25 09:48:56,031 INFO [train.py:996] (0/4) Epoch 8, batch 4550, loss[loss=0.2429, simple_loss=0.3231, pruned_loss=0.08138, over 21757.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3101, pruned_loss=0.07432, over 4269752.69 frames. ], batch size: 332, lr: 3.82e-03, grad_scale: 16.0 2023-06-25 09:50:03,168 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1308192.0, ans=0.125 2023-06-25 09:50:18,051 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.595e+02 3.343e+02 4.134e+02 5.307e+02 1.038e+03, threshold=8.269e+02, percent-clipped=0.0 2023-06-25 09:50:25,727 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1308252.0, ans=0.125 2023-06-25 09:50:52,058 INFO [train.py:996] (0/4) Epoch 8, batch 4600, loss[loss=0.2027, simple_loss=0.2782, pruned_loss=0.0636, over 21796.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.313, pruned_loss=0.07658, over 4272372.29 frames. ], batch size: 247, lr: 3.82e-03, grad_scale: 16.0 2023-06-25 09:50:58,025 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1308372.0, ans=0.125 2023-06-25 09:51:26,660 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1308432.0, ans=0.0 2023-06-25 09:51:47,756 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1308492.0, ans=0.1 2023-06-25 09:52:15,714 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1308552.0, ans=0.05 2023-06-25 09:52:22,876 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1308612.0, ans=0.125 2023-06-25 09:52:42,570 INFO [train.py:996] (0/4) Epoch 8, batch 4650, loss[loss=0.1665, simple_loss=0.2355, pruned_loss=0.04874, over 21244.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3068, pruned_loss=0.07434, over 4277315.21 frames. ], batch size: 176, lr: 3.82e-03, grad_scale: 8.0 2023-06-25 09:53:02,451 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1308672.0, ans=10.0 2023-06-25 09:53:28,077 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1308792.0, ans=0.015 2023-06-25 09:53:59,077 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.293e+02 3.213e+02 3.806e+02 5.357e+02 1.908e+03, threshold=7.612e+02, percent-clipped=10.0 2023-06-25 09:54:17,863 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1308912.0, ans=0.1 2023-06-25 09:54:18,268 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=1308912.0, ans=15.0 2023-06-25 09:54:30,161 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1308972.0, ans=0.0 2023-06-25 09:54:31,166 INFO [train.py:996] (0/4) Epoch 8, batch 4700, loss[loss=0.1943, simple_loss=0.2636, pruned_loss=0.06255, over 21824.00 frames. ], tot_loss[loss=0.22, simple_loss=0.2963, pruned_loss=0.07183, over 4265229.88 frames. ], batch size: 107, lr: 3.82e-03, grad_scale: 8.0 2023-06-25 09:54:47,321 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1308972.0, ans=0.1 2023-06-25 09:55:49,684 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1309152.0, ans=0.125 2023-06-25 09:56:14,635 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1309212.0, ans=0.125 2023-06-25 09:56:21,243 INFO [train.py:996] (0/4) Epoch 8, batch 4750, loss[loss=0.2185, simple_loss=0.2941, pruned_loss=0.07148, over 21356.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2901, pruned_loss=0.07111, over 4261003.80 frames. ], batch size: 131, lr: 3.82e-03, grad_scale: 8.0 2023-06-25 09:56:41,117 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1309272.0, ans=0.1 2023-06-25 09:57:03,171 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1309332.0, ans=0.2 2023-06-25 09:57:08,255 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1309392.0, ans=0.1 2023-06-25 09:57:39,341 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.746e+02 3.551e+02 4.538e+02 6.106e+02 1.235e+03, threshold=9.075e+02, percent-clipped=15.0 2023-06-25 09:58:17,095 INFO [train.py:996] (0/4) Epoch 8, batch 4800, loss[loss=0.2263, simple_loss=0.3305, pruned_loss=0.06102, over 21686.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2915, pruned_loss=0.07181, over 4273060.42 frames. ], batch size: 414, lr: 3.82e-03, grad_scale: 16.0 2023-06-25 09:58:20,831 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1309572.0, ans=0.0 2023-06-25 09:59:10,204 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.95 vs. limit=6.0 2023-06-25 09:59:23,470 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1309752.0, ans=0.0 2023-06-25 09:59:32,830 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.50 vs. limit=10.0 2023-06-25 09:59:59,469 INFO [train.py:996] (0/4) Epoch 8, batch 4850, loss[loss=0.2249, simple_loss=0.3012, pruned_loss=0.07432, over 21858.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2916, pruned_loss=0.07135, over 4275424.83 frames. ], batch size: 371, lr: 3.82e-03, grad_scale: 16.0 2023-06-25 10:01:16,364 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.775e+02 3.669e+02 4.660e+02 6.748e+02 1.065e+03, threshold=9.320e+02, percent-clipped=5.0 2023-06-25 10:01:45,486 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1310112.0, ans=0.2 2023-06-25 10:01:48,395 INFO [train.py:996] (0/4) Epoch 8, batch 4900, loss[loss=0.2235, simple_loss=0.3107, pruned_loss=0.06812, over 21400.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.2929, pruned_loss=0.07207, over 4281654.53 frames. ], batch size: 548, lr: 3.82e-03, grad_scale: 16.0 2023-06-25 10:02:19,314 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.58 vs. limit=15.0 2023-06-25 10:02:29,258 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=14.28 vs. limit=15.0 2023-06-25 10:02:35,688 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1310292.0, ans=0.2 2023-06-25 10:03:18,540 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1310412.0, ans=0.125 2023-06-25 10:03:47,913 INFO [train.py:996] (0/4) Epoch 8, batch 4950, loss[loss=0.2015, simple_loss=0.2969, pruned_loss=0.05306, over 21565.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.2984, pruned_loss=0.07064, over 4280506.54 frames. ], batch size: 441, lr: 3.82e-03, grad_scale: 16.0 2023-06-25 10:04:09,699 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1310532.0, ans=0.125 2023-06-25 10:05:00,801 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.225e+02 3.071e+02 4.183e+02 5.786e+02 1.763e+03, threshold=8.366e+02, percent-clipped=8.0 2023-06-25 10:05:28,888 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1310712.0, ans=0.125 2023-06-25 10:05:37,415 INFO [train.py:996] (0/4) Epoch 8, batch 5000, loss[loss=0.2375, simple_loss=0.3466, pruned_loss=0.06417, over 20688.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2977, pruned_loss=0.06748, over 4278873.55 frames. ], batch size: 607, lr: 3.82e-03, grad_scale: 16.0 2023-06-25 10:05:53,119 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1310832.0, ans=0.1 2023-06-25 10:06:00,393 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1310832.0, ans=0.2 2023-06-25 10:06:14,780 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.35 vs. limit=15.0 2023-06-25 10:06:34,269 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1310952.0, ans=0.125 2023-06-25 10:06:43,168 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1310952.0, ans=0.0 2023-06-25 10:06:45,020 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1310952.0, ans=0.125 2023-06-25 10:06:52,520 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1310952.0, ans=0.5 2023-06-25 10:06:59,433 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1311012.0, ans=0.1 2023-06-25 10:07:19,136 INFO [train.py:996] (0/4) Epoch 8, batch 5050, loss[loss=0.2379, simple_loss=0.3108, pruned_loss=0.08249, over 21872.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2964, pruned_loss=0.06921, over 4281232.04 frames. ], batch size: 118, lr: 3.82e-03, grad_scale: 16.0 2023-06-25 10:07:34,075 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.17 vs. limit=12.0 2023-06-25 10:08:04,883 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1311192.0, ans=0.0 2023-06-25 10:08:30,094 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.462e+02 3.598e+02 4.329e+02 6.155e+02 1.761e+03, threshold=8.658e+02, percent-clipped=10.0 2023-06-25 10:09:07,169 INFO [train.py:996] (0/4) Epoch 8, batch 5100, loss[loss=0.2048, simple_loss=0.2753, pruned_loss=0.06718, over 21874.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2948, pruned_loss=0.06995, over 4289654.85 frames. ], batch size: 107, lr: 3.81e-03, grad_scale: 16.0 2023-06-25 10:09:40,639 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1311432.0, ans=0.0 2023-06-25 10:09:46,285 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1311492.0, ans=0.125 2023-06-25 10:09:52,956 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1311492.0, ans=0.0 2023-06-25 10:10:52,987 INFO [train.py:996] (0/4) Epoch 8, batch 5150, loss[loss=0.1909, simple_loss=0.2758, pruned_loss=0.05297, over 19848.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2932, pruned_loss=0.06973, over 4288166.98 frames. ], batch size: 703, lr: 3.81e-03, grad_scale: 16.0 2023-06-25 10:11:30,848 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1311732.0, ans=0.125 2023-06-25 10:12:11,330 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.655e+02 3.617e+02 5.481e+02 7.313e+02 1.650e+03, threshold=1.096e+03, percent-clipped=16.0 2023-06-25 10:12:21,057 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1311852.0, ans=0.125 2023-06-25 10:12:38,524 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=2.661e-03 2023-06-25 10:12:48,535 INFO [train.py:996] (0/4) Epoch 8, batch 5200, loss[loss=0.2086, simple_loss=0.2922, pruned_loss=0.06243, over 21253.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2955, pruned_loss=0.07068, over 4288521.39 frames. ], batch size: 176, lr: 3.81e-03, grad_scale: 32.0 2023-06-25 10:12:58,859 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.26 vs. limit=15.0 2023-06-25 10:13:12,422 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1312032.0, ans=0.125 2023-06-25 10:13:25,813 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.28 vs. limit=15.0 2023-06-25 10:14:06,144 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1312152.0, ans=0.125 2023-06-25 10:14:34,285 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1312212.0, ans=0.0 2023-06-25 10:14:43,362 INFO [train.py:996] (0/4) Epoch 8, batch 5250, loss[loss=0.1952, simple_loss=0.2768, pruned_loss=0.05679, over 21226.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.3001, pruned_loss=0.07004, over 4287707.38 frames. ], batch size: 159, lr: 3.81e-03, grad_scale: 32.0 2023-06-25 10:14:44,034 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1312272.0, ans=0.125 2023-06-25 10:15:53,843 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.659e+02 3.587e+02 4.772e+02 6.547e+02 1.598e+03, threshold=9.543e+02, percent-clipped=4.0 2023-06-25 10:15:56,887 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.38 vs. limit=12.0 2023-06-25 10:16:04,936 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1312512.0, ans=0.1 2023-06-25 10:16:22,933 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.71 vs. limit=22.5 2023-06-25 10:16:29,986 INFO [train.py:996] (0/4) Epoch 8, batch 5300, loss[loss=0.2234, simple_loss=0.2935, pruned_loss=0.07669, over 21893.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.2995, pruned_loss=0.07093, over 4291860.19 frames. ], batch size: 371, lr: 3.81e-03, grad_scale: 32.0 2023-06-25 10:16:42,185 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1312572.0, ans=0.125 2023-06-25 10:16:47,973 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.21 vs. limit=15.0 2023-06-25 10:17:15,322 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1312692.0, ans=0.1 2023-06-25 10:17:51,956 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1312812.0, ans=0.125 2023-06-25 10:18:17,014 INFO [train.py:996] (0/4) Epoch 8, batch 5350, loss[loss=0.2245, simple_loss=0.3008, pruned_loss=0.07407, over 21356.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2976, pruned_loss=0.07152, over 4289871.97 frames. ], batch size: 159, lr: 3.81e-03, grad_scale: 32.0 2023-06-25 10:18:54,107 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1312932.0, ans=0.0 2023-06-25 10:19:12,898 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1313052.0, ans=0.125 2023-06-25 10:19:28,374 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.684e+02 3.542e+02 4.424e+02 5.994e+02 1.106e+03, threshold=8.848e+02, percent-clipped=4.0 2023-06-25 10:19:32,517 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1313052.0, ans=0.0 2023-06-25 10:20:05,141 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.50 vs. limit=15.0 2023-06-25 10:20:05,571 INFO [train.py:996] (0/4) Epoch 8, batch 5400, loss[loss=0.2363, simple_loss=0.2986, pruned_loss=0.08705, over 21247.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.2959, pruned_loss=0.07216, over 4291480.13 frames. ], batch size: 143, lr: 3.81e-03, grad_scale: 32.0 2023-06-25 10:20:34,216 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1313232.0, ans=0.125 2023-06-25 10:21:12,339 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1313352.0, ans=0.07 2023-06-25 10:21:13,811 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1313352.0, ans=0.0 2023-06-25 10:21:41,731 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1313412.0, ans=0.1 2023-06-25 10:21:47,093 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1313412.0, ans=0.125 2023-06-25 10:21:55,103 INFO [train.py:996] (0/4) Epoch 8, batch 5450, loss[loss=0.2668, simple_loss=0.3683, pruned_loss=0.08264, over 21711.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.2984, pruned_loss=0.07054, over 4293038.25 frames. ], batch size: 441, lr: 3.81e-03, grad_scale: 16.0 2023-06-25 10:21:55,579 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1313472.0, ans=0.0 2023-06-25 10:22:16,618 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1313532.0, ans=0.125 2023-06-25 10:22:21,661 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1313532.0, ans=0.0 2023-06-25 10:23:15,414 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.599e+02 4.381e+02 6.345e+02 1.127e+03 2.400e+03, threshold=1.269e+03, percent-clipped=34.0 2023-06-25 10:23:45,615 INFO [train.py:996] (0/4) Epoch 8, batch 5500, loss[loss=0.1989, simple_loss=0.3051, pruned_loss=0.04638, over 21583.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.3019, pruned_loss=0.06747, over 4281797.47 frames. ], batch size: 441, lr: 3.81e-03, grad_scale: 8.0 2023-06-25 10:25:02,360 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1313952.0, ans=0.125 2023-06-25 10:25:32,633 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1314012.0, ans=0.0 2023-06-25 10:25:35,623 INFO [train.py:996] (0/4) Epoch 8, batch 5550, loss[loss=0.1833, simple_loss=0.2775, pruned_loss=0.04456, over 21684.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.3008, pruned_loss=0.0643, over 4276356.26 frames. ], batch size: 298, lr: 3.81e-03, grad_scale: 8.0 2023-06-25 10:25:45,734 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 10:26:23,759 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1314192.0, ans=0.2 2023-06-25 10:27:03,719 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.281e+02 3.121e+02 4.354e+02 6.729e+02 1.471e+03, threshold=8.708e+02, percent-clipped=1.0 2023-06-25 10:27:20,183 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1314312.0, ans=0.2 2023-06-25 10:27:26,992 INFO [train.py:996] (0/4) Epoch 8, batch 5600, loss[loss=0.1305, simple_loss=0.202, pruned_loss=0.02946, over 21904.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2988, pruned_loss=0.06197, over 4275922.68 frames. ], batch size: 98, lr: 3.81e-03, grad_scale: 16.0 2023-06-25 10:27:52,972 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=1314432.0, ans=0.05 2023-06-25 10:29:15,395 INFO [train.py:996] (0/4) Epoch 8, batch 5650, loss[loss=0.236, simple_loss=0.311, pruned_loss=0.08046, over 21817.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.3029, pruned_loss=0.06439, over 4282228.41 frames. ], batch size: 107, lr: 3.81e-03, grad_scale: 16.0 2023-06-25 10:29:15,823 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1314672.0, ans=0.0 2023-06-25 10:29:46,959 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.35 vs. limit=22.5 2023-06-25 10:30:01,206 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1314732.0, ans=0.125 2023-06-25 10:30:10,130 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1314792.0, ans=0.5 2023-06-25 10:30:10,638 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.54 vs. limit=15.0 2023-06-25 10:30:42,463 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.720e+02 4.225e+02 5.470e+02 8.803e+02 1.575e+03, threshold=1.094e+03, percent-clipped=25.0 2023-06-25 10:31:12,021 INFO [train.py:996] (0/4) Epoch 8, batch 5700, loss[loss=0.2175, simple_loss=0.2823, pruned_loss=0.07638, over 21249.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.3025, pruned_loss=0.06626, over 4275061.89 frames. ], batch size: 607, lr: 3.81e-03, grad_scale: 16.0 2023-06-25 10:31:37,819 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1314972.0, ans=0.0 2023-06-25 10:32:19,359 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1315092.0, ans=0.1 2023-06-25 10:32:26,946 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1315152.0, ans=0.2 2023-06-25 10:33:11,825 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=1315212.0, ans=0.05 2023-06-25 10:33:14,789 INFO [train.py:996] (0/4) Epoch 8, batch 5750, loss[loss=0.1779, simple_loss=0.2664, pruned_loss=0.04467, over 21430.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2994, pruned_loss=0.06431, over 4276185.53 frames. ], batch size: 194, lr: 3.81e-03, grad_scale: 16.0 2023-06-25 10:33:38,200 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1315332.0, ans=0.125 2023-06-25 10:34:24,903 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=1315452.0, ans=0.05 2023-06-25 10:34:31,287 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.460e+02 3.956e+02 5.585e+02 8.690e+02 2.193e+03, threshold=1.117e+03, percent-clipped=12.0 2023-06-25 10:34:34,185 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.64 vs. limit=10.0 2023-06-25 10:34:56,664 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1315512.0, ans=0.0 2023-06-25 10:35:05,097 INFO [train.py:996] (0/4) Epoch 8, batch 5800, loss[loss=0.2378, simple_loss=0.3344, pruned_loss=0.07056, over 21658.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2964, pruned_loss=0.063, over 4270549.12 frames. ], batch size: 414, lr: 3.81e-03, grad_scale: 16.0 2023-06-25 10:35:12,333 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1315572.0, ans=0.0 2023-06-25 10:35:12,386 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_na.min_abs, batch_count=1315572.0, ans=0.02 2023-06-25 10:35:16,364 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.56 vs. limit=15.0 2023-06-25 10:35:40,850 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 10:35:40,881 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1315632.0, ans=0.125 2023-06-25 10:36:55,284 INFO [train.py:996] (0/4) Epoch 8, batch 5850, loss[loss=0.1715, simple_loss=0.2773, pruned_loss=0.03288, over 21623.00 frames. ], tot_loss[loss=0.2069, simple_loss=0.2945, pruned_loss=0.05971, over 4276764.67 frames. ], batch size: 263, lr: 3.81e-03, grad_scale: 16.0 2023-06-25 10:37:13,178 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1315872.0, ans=0.125 2023-06-25 10:38:09,628 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1316052.0, ans=0.04949747468305833 2023-06-25 10:38:16,549 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 10:38:21,042 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.019e+02 3.016e+02 4.169e+02 5.558e+02 1.178e+03, threshold=8.338e+02, percent-clipped=1.0 2023-06-25 10:38:37,618 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.83 vs. limit=6.0 2023-06-25 10:38:42,387 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1316172.0, ans=0.0 2023-06-25 10:38:43,385 INFO [train.py:996] (0/4) Epoch 8, batch 5900, loss[loss=0.1893, simple_loss=0.2703, pruned_loss=0.05419, over 21681.00 frames. ], tot_loss[loss=0.1997, simple_loss=0.2887, pruned_loss=0.05535, over 4276708.25 frames. ], batch size: 230, lr: 3.81e-03, grad_scale: 16.0 2023-06-25 10:38:47,627 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1316172.0, ans=0.1 2023-06-25 10:40:36,600 INFO [train.py:996] (0/4) Epoch 8, batch 5950, loss[loss=0.2226, simple_loss=0.289, pruned_loss=0.07814, over 21858.00 frames. ], tot_loss[loss=0.2025, simple_loss=0.2873, pruned_loss=0.05883, over 4285703.60 frames. ], batch size: 107, lr: 3.81e-03, grad_scale: 16.0 2023-06-25 10:40:39,496 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1316472.0, ans=0.2 2023-06-25 10:41:57,152 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.419e+02 3.705e+02 4.644e+02 6.015e+02 1.261e+03, threshold=9.288e+02, percent-clipped=6.0 2023-06-25 10:42:24,689 INFO [train.py:996] (0/4) Epoch 8, batch 6000, loss[loss=0.2143, simple_loss=0.275, pruned_loss=0.07679, over 21812.00 frames. ], tot_loss[loss=0.203, simple_loss=0.2828, pruned_loss=0.06166, over 4285005.83 frames. ], batch size: 112, lr: 3.81e-03, grad_scale: 32.0 2023-06-25 10:42:24,691 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-25 10:42:38,356 INFO [zipformer.py:1728] (0/4) name=encoder.encoders.4.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([2.3570, 3.9713, 3.5481, 2.4677], device='cuda:0') 2023-06-25 10:42:43,103 INFO [train.py:1028] (0/4) Epoch 8, validation: loss=0.2599, simple_loss=0.3542, pruned_loss=0.08283, over 1796401.00 frames. 2023-06-25 10:42:43,104 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23714MB 2023-06-25 10:43:20,876 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1316832.0, ans=0.1 2023-06-25 10:44:30,942 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1317072.0, ans=0.125 2023-06-25 10:44:32,111 INFO [train.py:996] (0/4) Epoch 8, batch 6050, loss[loss=0.2323, simple_loss=0.2799, pruned_loss=0.09232, over 21296.00 frames. ], tot_loss[loss=0.2021, simple_loss=0.2781, pruned_loss=0.06305, over 4284609.41 frames. ], batch size: 473, lr: 3.81e-03, grad_scale: 16.0 2023-06-25 10:45:00,455 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.40 vs. limit=15.0 2023-06-25 10:45:12,918 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.52 vs. limit=15.0 2023-06-25 10:45:22,025 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.43 vs. limit=12.0 2023-06-25 10:45:25,358 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.45 vs. limit=12.0 2023-06-25 10:45:34,045 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.32 vs. limit=15.0 2023-06-25 10:45:54,662 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1317252.0, ans=0.125 2023-06-25 10:46:02,682 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.258e+02 3.027e+02 3.543e+02 4.966e+02 9.624e+02, threshold=7.086e+02, percent-clipped=3.0 2023-06-25 10:46:15,122 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1317312.0, ans=0.125 2023-06-25 10:46:21,238 INFO [train.py:996] (0/4) Epoch 8, batch 6100, loss[loss=0.2116, simple_loss=0.3222, pruned_loss=0.05049, over 19794.00 frames. ], tot_loss[loss=0.2009, simple_loss=0.2779, pruned_loss=0.06197, over 4284583.90 frames. ], batch size: 702, lr: 3.81e-03, grad_scale: 8.0 2023-06-25 10:46:23,720 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1317372.0, ans=0.1 2023-06-25 10:46:25,310 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1317372.0, ans=0.2 2023-06-25 10:46:59,191 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1317432.0, ans=0.1 2023-06-25 10:47:23,091 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1317492.0, ans=0.125 2023-06-25 10:47:33,234 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1317552.0, ans=0.0 2023-06-25 10:47:41,568 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1317552.0, ans=0.0 2023-06-25 10:47:56,400 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.78 vs. limit=15.0 2023-06-25 10:48:09,188 INFO [train.py:996] (0/4) Epoch 8, batch 6150, loss[loss=0.2274, simple_loss=0.2927, pruned_loss=0.08104, over 21111.00 frames. ], tot_loss[loss=0.2053, simple_loss=0.2811, pruned_loss=0.0647, over 4287515.36 frames. ], batch size: 159, lr: 3.81e-03, grad_scale: 8.0 2023-06-25 10:48:39,654 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1317732.0, ans=0.04949747468305833 2023-06-25 10:48:58,970 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1317792.0, ans=0.0 2023-06-25 10:49:38,027 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.673e+02 3.233e+02 3.904e+02 5.485e+02 1.131e+03, threshold=7.808e+02, percent-clipped=12.0 2023-06-25 10:49:48,635 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.39 vs. limit=15.0 2023-06-25 10:49:58,257 INFO [train.py:996] (0/4) Epoch 8, batch 6200, loss[loss=0.2387, simple_loss=0.3195, pruned_loss=0.07898, over 21849.00 frames. ], tot_loss[loss=0.2076, simple_loss=0.2851, pruned_loss=0.06505, over 4279969.61 frames. ], batch size: 351, lr: 3.81e-03, grad_scale: 8.0 2023-06-25 10:50:54,391 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1318092.0, ans=0.0 2023-06-25 10:51:31,441 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=5.84 vs. limit=22.5 2023-06-25 10:51:37,828 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1318212.0, ans=0.125 2023-06-25 10:51:49,420 INFO [train.py:996] (0/4) Epoch 8, batch 6250, loss[loss=0.2157, simple_loss=0.321, pruned_loss=0.05523, over 21765.00 frames. ], tot_loss[loss=0.2108, simple_loss=0.2912, pruned_loss=0.06523, over 4280213.68 frames. ], batch size: 332, lr: 3.80e-03, grad_scale: 8.0 2023-06-25 10:53:17,968 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.870e+02 4.537e+02 6.426e+02 9.551e+02 1.693e+03, threshold=1.285e+03, percent-clipped=41.0 2023-06-25 10:53:20,333 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1318512.0, ans=0.0 2023-06-25 10:53:27,053 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 10:53:30,920 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1318512.0, ans=0.125 2023-06-25 10:53:42,673 INFO [train.py:996] (0/4) Epoch 8, batch 6300, loss[loss=0.2278, simple_loss=0.3028, pruned_loss=0.0764, over 21894.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.294, pruned_loss=0.0639, over 4277850.39 frames. ], batch size: 118, lr: 3.80e-03, grad_scale: 8.0 2023-06-25 10:54:16,307 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1318632.0, ans=0.125 2023-06-25 10:54:37,106 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1318692.0, ans=0.0 2023-06-25 10:54:39,608 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.81 vs. limit=15.0 2023-06-25 10:54:47,655 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1318692.0, ans=0.2 2023-06-25 10:55:42,043 INFO [train.py:996] (0/4) Epoch 8, batch 6350, loss[loss=0.2696, simple_loss=0.3384, pruned_loss=0.1004, over 21603.00 frames. ], tot_loss[loss=0.2156, simple_loss=0.2958, pruned_loss=0.06769, over 4286493.63 frames. ], batch size: 414, lr: 3.80e-03, grad_scale: 8.0 2023-06-25 10:56:02,043 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1318932.0, ans=0.0 2023-06-25 10:56:12,470 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1318932.0, ans=0.0 2023-06-25 10:57:02,236 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.857e+02 3.830e+02 4.751e+02 5.817e+02 1.226e+03, threshold=9.501e+02, percent-clipped=0.0 2023-06-25 10:57:27,565 INFO [train.py:996] (0/4) Epoch 8, batch 6400, loss[loss=0.291, simple_loss=0.3494, pruned_loss=0.1164, over 21438.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.3016, pruned_loss=0.07169, over 4290298.31 frames. ], batch size: 471, lr: 3.80e-03, grad_scale: 16.0 2023-06-25 10:57:28,084 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1319172.0, ans=0.035 2023-06-25 10:58:34,709 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1319352.0, ans=0.1 2023-06-25 10:58:44,811 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1319352.0, ans=0.125 2023-06-25 10:59:12,759 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=1319412.0, ans=0.05 2023-06-25 10:59:17,479 INFO [train.py:996] (0/4) Epoch 8, batch 6450, loss[loss=0.1987, simple_loss=0.2982, pruned_loss=0.04958, over 21882.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.3044, pruned_loss=0.07107, over 4280379.87 frames. ], batch size: 317, lr: 3.80e-03, grad_scale: 16.0 2023-06-25 11:00:42,535 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.679e+02 3.948e+02 4.858e+02 6.624e+02 1.248e+03, threshold=9.716e+02, percent-clipped=3.0 2023-06-25 11:01:06,964 INFO [train.py:996] (0/4) Epoch 8, batch 6500, loss[loss=0.2107, simple_loss=0.267, pruned_loss=0.07723, over 21256.00 frames. ], tot_loss[loss=0.219, simple_loss=0.2982, pruned_loss=0.06987, over 4274413.30 frames. ], batch size: 471, lr: 3.80e-03, grad_scale: 16.0 2023-06-25 11:01:44,868 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1319892.0, ans=0.125 2023-06-25 11:01:46,538 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1319892.0, ans=0.125 2023-06-25 11:02:28,171 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-220000.pt 2023-06-25 11:02:41,438 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.47 vs. limit=22.5 2023-06-25 11:02:57,211 INFO [train.py:996] (0/4) Epoch 8, batch 6550, loss[loss=0.1648, simple_loss=0.2903, pruned_loss=0.01969, over 20840.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.297, pruned_loss=0.06878, over 4278662.62 frames. ], batch size: 607, lr: 3.80e-03, grad_scale: 16.0 2023-06-25 11:03:16,857 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1320132.0, ans=0.1 2023-06-25 11:03:17,504 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.73 vs. limit=15.0 2023-06-25 11:04:22,985 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.651e+02 3.584e+02 5.538e+02 7.556e+02 1.701e+03, threshold=1.108e+03, percent-clipped=14.0 2023-06-25 11:04:46,583 INFO [train.py:996] (0/4) Epoch 8, batch 6600, loss[loss=0.1773, simple_loss=0.2405, pruned_loss=0.05707, over 21390.00 frames. ], tot_loss[loss=0.215, simple_loss=0.2921, pruned_loss=0.06895, over 4272079.04 frames. ], batch size: 211, lr: 3.80e-03, grad_scale: 8.0 2023-06-25 11:06:17,532 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1320612.0, ans=0.05 2023-06-25 11:06:36,632 INFO [train.py:996] (0/4) Epoch 8, batch 6650, loss[loss=0.175, simple_loss=0.2503, pruned_loss=0.04978, over 21555.00 frames. ], tot_loss[loss=0.2089, simple_loss=0.2851, pruned_loss=0.06635, over 4279021.55 frames. ], batch size: 230, lr: 3.80e-03, grad_scale: 8.0 2023-06-25 11:06:57,619 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1320732.0, ans=0.1 2023-06-25 11:07:01,209 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1320732.0, ans=0.125 2023-06-25 11:07:03,019 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1320732.0, ans=0.0 2023-06-25 11:08:03,791 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.500e+02 3.315e+02 4.377e+02 5.902e+02 1.210e+03, threshold=8.754e+02, percent-clipped=3.0 2023-06-25 11:08:26,394 INFO [train.py:996] (0/4) Epoch 8, batch 6700, loss[loss=0.1958, simple_loss=0.2626, pruned_loss=0.06455, over 21520.00 frames. ], tot_loss[loss=0.2053, simple_loss=0.2792, pruned_loss=0.06576, over 4279200.62 frames. ], batch size: 230, lr: 3.80e-03, grad_scale: 8.0 2023-06-25 11:10:10,076 INFO [train.py:996] (0/4) Epoch 8, batch 6750, loss[loss=0.2386, simple_loss=0.3433, pruned_loss=0.06694, over 19825.00 frames. ], tot_loss[loss=0.2049, simple_loss=0.2772, pruned_loss=0.06626, over 4281021.51 frames. ], batch size: 703, lr: 3.80e-03, grad_scale: 8.0 2023-06-25 11:10:23,042 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1321272.0, ans=0.0 2023-06-25 11:11:04,972 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1321392.0, ans=0.0 2023-06-25 11:11:12,233 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1321392.0, ans=0.2 2023-06-25 11:11:24,535 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1321452.0, ans=0.2 2023-06-25 11:11:35,909 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.658e+02 3.451e+02 4.455e+02 6.236e+02 1.487e+03, threshold=8.910e+02, percent-clipped=11.0 2023-06-25 11:11:39,252 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.85 vs. limit=15.0 2023-06-25 11:11:58,608 INFO [train.py:996] (0/4) Epoch 8, batch 6800, loss[loss=0.2175, simple_loss=0.2831, pruned_loss=0.07591, over 21822.00 frames. ], tot_loss[loss=0.2085, simple_loss=0.2803, pruned_loss=0.06835, over 4271448.77 frames. ], batch size: 118, lr: 3.80e-03, grad_scale: 16.0 2023-06-25 11:12:14,715 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1321632.0, ans=0.125 2023-06-25 11:12:43,827 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.78 vs. limit=15.0 2023-06-25 11:12:43,906 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.48 vs. limit=15.0 2023-06-25 11:12:55,766 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 11:13:28,589 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1321812.0, ans=0.0 2023-06-25 11:13:41,750 INFO [train.py:996] (0/4) Epoch 8, batch 6850, loss[loss=0.21, simple_loss=0.2758, pruned_loss=0.07206, over 21546.00 frames. ], tot_loss[loss=0.2095, simple_loss=0.281, pruned_loss=0.069, over 4263465.58 frames. ], batch size: 389, lr: 3.80e-03, grad_scale: 16.0 2023-06-25 11:14:07,660 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=1321932.0, ans=15.0 2023-06-25 11:14:27,035 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1321992.0, ans=0.04949747468305833 2023-06-25 11:15:09,322 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.681e+02 3.753e+02 5.063e+02 7.364e+02 1.523e+03, threshold=1.013e+03, percent-clipped=16.0 2023-06-25 11:15:15,569 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1322112.0, ans=0.0 2023-06-25 11:15:32,229 INFO [train.py:996] (0/4) Epoch 8, batch 6900, loss[loss=0.2, simple_loss=0.282, pruned_loss=0.05904, over 21898.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.2829, pruned_loss=0.06983, over 4277321.56 frames. ], batch size: 316, lr: 3.80e-03, grad_scale: 16.0 2023-06-25 11:15:34,697 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1322172.0, ans=0.1 2023-06-25 11:15:58,200 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1322232.0, ans=0.2 2023-06-25 11:16:00,087 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1322232.0, ans=0.1 2023-06-25 11:16:09,414 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1322232.0, ans=0.1 2023-06-25 11:16:43,283 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1322352.0, ans=10.0 2023-06-25 11:17:20,692 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1322412.0, ans=0.2 2023-06-25 11:17:23,577 INFO [train.py:996] (0/4) Epoch 8, batch 6950, loss[loss=0.3069, simple_loss=0.3553, pruned_loss=0.1293, over 21337.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.286, pruned_loss=0.06754, over 4278265.58 frames. ], batch size: 507, lr: 3.80e-03, grad_scale: 16.0 2023-06-25 11:17:26,171 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1322472.0, ans=0.2 2023-06-25 11:17:57,613 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1322532.0, ans=0.5 2023-06-25 11:18:54,438 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.354e+02 3.544e+02 4.966e+02 6.681e+02 1.694e+03, threshold=9.931e+02, percent-clipped=7.0 2023-06-25 11:19:12,244 INFO [train.py:996] (0/4) Epoch 8, batch 7000, loss[loss=0.2176, simple_loss=0.2773, pruned_loss=0.07893, over 21125.00 frames. ], tot_loss[loss=0.2142, simple_loss=0.2895, pruned_loss=0.06945, over 4271105.95 frames. ], batch size: 143, lr: 3.80e-03, grad_scale: 16.0 2023-06-25 11:20:56,850 INFO [train.py:996] (0/4) Epoch 8, batch 7050, loss[loss=0.1605, simple_loss=0.2333, pruned_loss=0.04388, over 21492.00 frames. ], tot_loss[loss=0.2118, simple_loss=0.2857, pruned_loss=0.06897, over 4268691.24 frames. ], batch size: 131, lr: 3.80e-03, grad_scale: 16.0 2023-06-25 11:21:23,102 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1323072.0, ans=0.125 2023-06-25 11:21:59,240 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1323192.0, ans=0.125 2023-06-25 11:22:22,593 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1323252.0, ans=0.0 2023-06-25 11:22:30,815 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.752e+02 3.663e+02 4.659e+02 6.225e+02 9.950e+02, threshold=9.319e+02, percent-clipped=1.0 2023-06-25 11:22:41,527 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.31 vs. limit=22.5 2023-06-25 11:22:48,493 INFO [train.py:996] (0/4) Epoch 8, batch 7100, loss[loss=0.1837, simple_loss=0.266, pruned_loss=0.05073, over 21681.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2889, pruned_loss=0.06877, over 4258915.44 frames. ], batch size: 298, lr: 3.80e-03, grad_scale: 16.0 2023-06-25 11:23:46,126 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1323492.0, ans=0.2 2023-06-25 11:23:58,408 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 11:24:42,543 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.93 vs. limit=10.0 2023-06-25 11:24:44,533 INFO [train.py:996] (0/4) Epoch 8, batch 7150, loss[loss=0.1876, simple_loss=0.2487, pruned_loss=0.06322, over 20700.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.2875, pruned_loss=0.06688, over 4260967.42 frames. ], batch size: 607, lr: 3.80e-03, grad_scale: 16.0 2023-06-25 11:25:16,981 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 11:25:25,651 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1323792.0, ans=0.1 2023-06-25 11:25:28,962 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1323792.0, ans=0.1 2023-06-25 11:25:36,589 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1323792.0, ans=0.125 2023-06-25 11:26:11,774 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.272e+02 3.580e+02 4.514e+02 6.175e+02 1.199e+03, threshold=9.027e+02, percent-clipped=4.0 2023-06-25 11:26:25,163 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1323912.0, ans=0.125 2023-06-25 11:26:28,881 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1323912.0, ans=0.2 2023-06-25 11:26:40,820 INFO [train.py:996] (0/4) Epoch 8, batch 7200, loss[loss=0.2336, simple_loss=0.3428, pruned_loss=0.06218, over 19770.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2899, pruned_loss=0.06902, over 4260090.95 frames. ], batch size: 703, lr: 3.80e-03, grad_scale: 32.0 2023-06-25 11:28:28,916 INFO [train.py:996] (0/4) Epoch 8, batch 7250, loss[loss=0.1805, simple_loss=0.2411, pruned_loss=0.05991, over 21432.00 frames. ], tot_loss[loss=0.2122, simple_loss=0.2862, pruned_loss=0.06907, over 4267848.53 frames. ], batch size: 212, lr: 3.80e-03, grad_scale: 16.0 2023-06-25 11:28:36,139 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1324272.0, ans=0.125 2023-06-25 11:28:44,522 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1324332.0, ans=0.1 2023-06-25 11:28:58,137 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1324332.0, ans=0.125 2023-06-25 11:29:26,091 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1324452.0, ans=0.125 2023-06-25 11:29:34,728 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1324452.0, ans=0.0 2023-06-25 11:29:51,148 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.614e+02 3.605e+02 4.552e+02 6.343e+02 1.382e+03, threshold=9.103e+02, percent-clipped=6.0 2023-06-25 11:30:17,030 INFO [train.py:996] (0/4) Epoch 8, batch 7300, loss[loss=0.2004, simple_loss=0.2658, pruned_loss=0.06748, over 21976.00 frames. ], tot_loss[loss=0.2076, simple_loss=0.2798, pruned_loss=0.06765, over 4272399.23 frames. ], batch size: 103, lr: 3.80e-03, grad_scale: 16.0 2023-06-25 11:30:19,630 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1324572.0, ans=0.0 2023-06-25 11:30:37,667 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1324632.0, ans=0.125 2023-06-25 11:30:46,324 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1324632.0, ans=0.125 2023-06-25 11:30:53,458 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1324632.0, ans=0.0 2023-06-25 11:31:11,433 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.28 vs. limit=12.0 2023-06-25 11:31:26,796 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=6.73 vs. limit=22.5 2023-06-25 11:31:26,876 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.32 vs. limit=15.0 2023-06-25 11:31:41,227 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1324812.0, ans=0.0 2023-06-25 11:32:07,792 INFO [train.py:996] (0/4) Epoch 8, batch 7350, loss[loss=0.2384, simple_loss=0.3148, pruned_loss=0.08098, over 21485.00 frames. ], tot_loss[loss=0.2069, simple_loss=0.2782, pruned_loss=0.06776, over 4267351.24 frames. ], batch size: 131, lr: 3.80e-03, grad_scale: 16.0 2023-06-25 11:32:14,256 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=1324872.0, ans=10.0 2023-06-25 11:32:14,303 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1324872.0, ans=0.0 2023-06-25 11:33:44,446 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.590e+02 4.058e+02 5.630e+02 9.164e+02 1.929e+03, threshold=1.126e+03, percent-clipped=26.0 2023-06-25 11:34:01,241 INFO [train.py:996] (0/4) Epoch 8, batch 7400, loss[loss=0.2315, simple_loss=0.3214, pruned_loss=0.07083, over 21704.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2828, pruned_loss=0.06979, over 4268643.29 frames. ], batch size: 415, lr: 3.79e-03, grad_scale: 16.0 2023-06-25 11:34:05,960 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.04 vs. limit=6.0 2023-06-25 11:35:51,161 INFO [train.py:996] (0/4) Epoch 8, batch 7450, loss[loss=0.1859, simple_loss=0.2566, pruned_loss=0.05757, over 21397.00 frames. ], tot_loss[loss=0.2094, simple_loss=0.2808, pruned_loss=0.06901, over 4255811.96 frames. ], batch size: 131, lr: 3.79e-03, grad_scale: 16.0 2023-06-25 11:35:57,118 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1325472.0, ans=0.125 2023-06-25 11:36:15,200 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 11:36:24,215 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1325532.0, ans=0.2 2023-06-25 11:37:24,266 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.26 vs. limit=10.0 2023-06-25 11:37:28,590 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.598e+02 3.413e+02 4.464e+02 6.199e+02 1.662e+03, threshold=8.927e+02, percent-clipped=2.0 2023-06-25 11:37:50,177 INFO [train.py:996] (0/4) Epoch 8, batch 7500, loss[loss=0.2405, simple_loss=0.3461, pruned_loss=0.06744, over 21625.00 frames. ], tot_loss[loss=0.2133, simple_loss=0.2862, pruned_loss=0.0702, over 4263363.50 frames. ], batch size: 263, lr: 3.79e-03, grad_scale: 16.0 2023-06-25 11:39:13,527 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 11:39:34,959 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1326012.0, ans=0.2 2023-06-25 11:39:37,785 INFO [train.py:996] (0/4) Epoch 8, batch 7550, loss[loss=0.2102, simple_loss=0.2921, pruned_loss=0.0642, over 21795.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2948, pruned_loss=0.07012, over 4271694.12 frames. ], batch size: 118, lr: 3.79e-03, grad_scale: 16.0 2023-06-25 11:40:09,266 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.40 vs. limit=6.0 2023-06-25 11:40:47,588 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.87 vs. limit=22.5 2023-06-25 11:40:48,971 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1326252.0, ans=0.125 2023-06-25 11:41:05,655 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.417e+02 3.672e+02 5.210e+02 9.088e+02 2.173e+03, threshold=1.042e+03, percent-clipped=24.0 2023-06-25 11:41:12,501 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.68 vs. limit=15.0 2023-06-25 11:41:26,457 INFO [train.py:996] (0/4) Epoch 8, batch 7600, loss[loss=0.2164, simple_loss=0.2946, pruned_loss=0.06912, over 21744.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2939, pruned_loss=0.06934, over 4274801.62 frames. ], batch size: 389, lr: 3.79e-03, grad_scale: 32.0 2023-06-25 11:41:42,244 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1326432.0, ans=0.1 2023-06-25 11:41:52,680 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1326432.0, ans=0.125 2023-06-25 11:42:18,174 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.41 vs. limit=15.0 2023-06-25 11:42:19,945 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.44 vs. limit=22.5 2023-06-25 11:42:34,913 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1326552.0, ans=0.1 2023-06-25 11:42:36,723 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1326552.0, ans=0.1 2023-06-25 11:42:43,573 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1326552.0, ans=0.125 2023-06-25 11:42:50,962 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1326612.0, ans=0.125 2023-06-25 11:43:09,690 INFO [train.py:996] (0/4) Epoch 8, batch 7650, loss[loss=0.2233, simple_loss=0.2919, pruned_loss=0.07733, over 21768.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.294, pruned_loss=0.06985, over 4283540.86 frames. ], batch size: 389, lr: 3.79e-03, grad_scale: 32.0 2023-06-25 11:43:15,469 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1326672.0, ans=0.125 2023-06-25 11:44:10,203 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1326792.0, ans=0.0 2023-06-25 11:44:43,903 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1326912.0, ans=0.125 2023-06-25 11:44:44,984 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.753e+02 3.604e+02 4.352e+02 5.552e+02 1.331e+03, threshold=8.705e+02, percent-clipped=4.0 2023-06-25 11:44:58,554 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1326972.0, ans=0.125 2023-06-25 11:44:59,559 INFO [train.py:996] (0/4) Epoch 8, batch 7700, loss[loss=0.2484, simple_loss=0.3142, pruned_loss=0.09125, over 21403.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.2958, pruned_loss=0.07233, over 4286353.41 frames. ], batch size: 548, lr: 3.79e-03, grad_scale: 16.0 2023-06-25 11:45:17,979 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1327032.0, ans=0.1 2023-06-25 11:45:33,691 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1327032.0, ans=0.125 2023-06-25 11:46:06,627 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1327152.0, ans=0.0 2023-06-25 11:46:11,901 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1327152.0, ans=0.125 2023-06-25 11:46:33,432 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.18 vs. limit=10.0 2023-06-25 11:46:46,205 INFO [train.py:996] (0/4) Epoch 8, batch 7750, loss[loss=0.2587, simple_loss=0.3485, pruned_loss=0.0845, over 21591.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.2996, pruned_loss=0.07289, over 4280214.84 frames. ], batch size: 230, lr: 3.79e-03, grad_scale: 8.0 2023-06-25 11:47:35,941 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1327332.0, ans=0.0 2023-06-25 11:48:09,919 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.22 vs. limit=22.5 2023-06-25 11:48:24,231 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.48 vs. limit=12.0 2023-06-25 11:48:24,882 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.807e+02 4.127e+02 5.917e+02 8.235e+02 1.345e+03, threshold=1.183e+03, percent-clipped=19.0 2023-06-25 11:48:37,411 INFO [train.py:996] (0/4) Epoch 8, batch 7800, loss[loss=0.2312, simple_loss=0.3039, pruned_loss=0.07923, over 21856.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.3015, pruned_loss=0.07363, over 4277544.57 frames. ], batch size: 373, lr: 3.79e-03, grad_scale: 8.0 2023-06-25 11:49:10,441 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1327632.0, ans=0.09899494936611666 2023-06-25 11:49:36,140 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1327692.0, ans=0.0 2023-06-25 11:50:02,599 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1327752.0, ans=0.0 2023-06-25 11:50:26,499 INFO [train.py:996] (0/4) Epoch 8, batch 7850, loss[loss=0.2189, simple_loss=0.2958, pruned_loss=0.07098, over 20673.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.2961, pruned_loss=0.07318, over 4263464.73 frames. ], batch size: 607, lr: 3.79e-03, grad_scale: 8.0 2023-06-25 11:50:39,139 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1327872.0, ans=0.125 2023-06-25 11:51:54,148 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1328052.0, ans=0.0 2023-06-25 11:52:07,055 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.783e+02 3.555e+02 5.085e+02 7.464e+02 1.705e+03, threshold=1.017e+03, percent-clipped=5.0 2023-06-25 11:52:26,556 INFO [train.py:996] (0/4) Epoch 8, batch 7900, loss[loss=0.1944, simple_loss=0.2669, pruned_loss=0.06096, over 21232.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.2925, pruned_loss=0.07288, over 4261399.64 frames. ], batch size: 176, lr: 3.79e-03, grad_scale: 8.0 2023-06-25 11:53:57,201 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.15 vs. limit=10.0 2023-06-25 11:54:24,570 INFO [train.py:996] (0/4) Epoch 8, batch 7950, loss[loss=0.2315, simple_loss=0.316, pruned_loss=0.07348, over 21892.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.296, pruned_loss=0.07222, over 4255776.33 frames. ], batch size: 316, lr: 3.79e-03, grad_scale: 8.0 2023-06-25 11:54:25,850 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.68 vs. limit=15.0 2023-06-25 11:54:43,585 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.95 vs. limit=6.0 2023-06-25 11:55:36,589 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1328652.0, ans=0.1 2023-06-25 11:56:11,366 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.729e+02 4.611e+02 6.417e+02 9.938e+02 3.239e+03, threshold=1.283e+03, percent-clipped=22.0 2023-06-25 11:56:12,012 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1328712.0, ans=0.125 2023-06-25 11:56:14,037 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1328712.0, ans=0.125 2023-06-25 11:56:15,830 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1328712.0, ans=0.125 2023-06-25 11:56:21,032 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1328712.0, ans=0.0 2023-06-25 11:56:24,111 INFO [train.py:996] (0/4) Epoch 8, batch 8000, loss[loss=0.256, simple_loss=0.3396, pruned_loss=0.08625, over 21600.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.2991, pruned_loss=0.0738, over 4257994.45 frames. ], batch size: 414, lr: 3.79e-03, grad_scale: 16.0 2023-06-25 11:57:01,970 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1328832.0, ans=0.125 2023-06-25 11:58:24,050 INFO [train.py:996] (0/4) Epoch 8, batch 8050, loss[loss=0.2024, simple_loss=0.3092, pruned_loss=0.0478, over 20787.00 frames. ], tot_loss[loss=0.224, simple_loss=0.2998, pruned_loss=0.07411, over 4248190.59 frames. ], batch size: 609, lr: 3.79e-03, grad_scale: 16.0 2023-06-25 11:58:32,453 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.84 vs. limit=15.0 2023-06-25 11:58:42,917 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1329132.0, ans=0.125 2023-06-25 11:59:09,823 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1329192.0, ans=0.5 2023-06-25 11:59:13,606 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1329192.0, ans=0.1 2023-06-25 11:59:24,854 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1329192.0, ans=0.1 2023-06-25 11:59:51,107 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.05 vs. limit=15.0 2023-06-25 12:00:03,825 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.754e+02 4.648e+02 6.798e+02 1.163e+03 2.924e+03, threshold=1.360e+03, percent-clipped=20.0 2023-06-25 12:00:16,727 INFO [train.py:996] (0/4) Epoch 8, batch 8100, loss[loss=0.2322, simple_loss=0.3095, pruned_loss=0.07743, over 21733.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.3003, pruned_loss=0.074, over 4253620.62 frames. ], batch size: 389, lr: 3.79e-03, grad_scale: 16.0 2023-06-25 12:01:13,099 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1329492.0, ans=0.2 2023-06-25 12:01:41,248 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1329552.0, ans=0.1 2023-06-25 12:01:48,666 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1329552.0, ans=0.0 2023-06-25 12:01:49,257 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=19.62 vs. limit=22.5 2023-06-25 12:02:15,827 INFO [train.py:996] (0/4) Epoch 8, batch 8150, loss[loss=0.1934, simple_loss=0.2757, pruned_loss=0.05555, over 21494.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3094, pruned_loss=0.07555, over 4261332.87 frames. ], batch size: 195, lr: 3.79e-03, grad_scale: 16.0 2023-06-25 12:02:37,759 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1329732.0, ans=0.0 2023-06-25 12:02:40,031 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.37 vs. limit=15.0 2023-06-25 12:03:00,988 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1329792.0, ans=0.125 2023-06-25 12:03:47,518 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.902e+02 4.317e+02 6.289e+02 1.033e+03 2.172e+03, threshold=1.258e+03, percent-clipped=12.0 2023-06-25 12:04:04,949 INFO [train.py:996] (0/4) Epoch 8, batch 8200, loss[loss=0.2044, simple_loss=0.2685, pruned_loss=0.07012, over 21560.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.302, pruned_loss=0.0739, over 4251334.33 frames. ], batch size: 391, lr: 3.79e-03, grad_scale: 16.0 2023-06-25 12:04:30,593 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1330032.0, ans=0.2 2023-06-25 12:04:43,665 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1330032.0, ans=0.125 2023-06-25 12:05:41,362 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 12:05:54,968 INFO [train.py:996] (0/4) Epoch 8, batch 8250, loss[loss=0.2497, simple_loss=0.3452, pruned_loss=0.0771, over 21644.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.3012, pruned_loss=0.07355, over 4249489.31 frames. ], batch size: 414, lr: 3.79e-03, grad_scale: 16.0 2023-06-25 12:06:31,357 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1330332.0, ans=0.125 2023-06-25 12:07:16,084 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.87 vs. limit=15.0 2023-06-25 12:07:27,233 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.445e+02 3.428e+02 4.253e+02 6.741e+02 1.234e+03, threshold=8.505e+02, percent-clipped=0.0 2023-06-25 12:07:34,937 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1330512.0, ans=0.125 2023-06-25 12:07:49,927 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.19 vs. limit=10.0 2023-06-25 12:07:50,423 INFO [train.py:996] (0/4) Epoch 8, batch 8300, loss[loss=0.2416, simple_loss=0.3246, pruned_loss=0.07931, over 21622.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.2988, pruned_loss=0.07105, over 4252854.23 frames. ], batch size: 389, lr: 3.79e-03, grad_scale: 16.0 2023-06-25 12:08:31,922 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.95 vs. limit=12.0 2023-06-25 12:08:34,594 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 12:08:35,058 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.54 vs. limit=22.5 2023-06-25 12:08:49,192 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1330692.0, ans=0.1 2023-06-25 12:09:01,447 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.50 vs. limit=15.0 2023-06-25 12:09:23,919 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.15 vs. limit=10.0 2023-06-25 12:09:39,014 INFO [train.py:996] (0/4) Epoch 8, batch 8350, loss[loss=0.2127, simple_loss=0.3013, pruned_loss=0.0621, over 21753.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.2993, pruned_loss=0.06951, over 4260351.98 frames. ], batch size: 282, lr: 3.79e-03, grad_scale: 16.0 2023-06-25 12:10:10,247 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1330932.0, ans=0.07 2023-06-25 12:10:29,767 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1330992.0, ans=0.125 2023-06-25 12:10:31,559 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1330992.0, ans=0.125 2023-06-25 12:10:38,472 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1330992.0, ans=0.0 2023-06-25 12:10:43,312 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1331052.0, ans=10.0 2023-06-25 12:10:48,681 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1331052.0, ans=0.125 2023-06-25 12:10:50,290 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1331052.0, ans=0.125 2023-06-25 12:11:07,571 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1331112.0, ans=0.125 2023-06-25 12:11:10,350 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.535e+02 3.482e+02 5.019e+02 7.188e+02 1.647e+03, threshold=1.004e+03, percent-clipped=15.0 2023-06-25 12:11:27,225 INFO [train.py:996] (0/4) Epoch 8, batch 8400, loss[loss=0.1746, simple_loss=0.2635, pruned_loss=0.04281, over 21484.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2958, pruned_loss=0.06691, over 4262222.33 frames. ], batch size: 212, lr: 3.79e-03, grad_scale: 32.0 2023-06-25 12:11:47,726 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1331232.0, ans=0.1 2023-06-25 12:13:15,386 INFO [train.py:996] (0/4) Epoch 8, batch 8450, loss[loss=0.1679, simple_loss=0.2047, pruned_loss=0.06553, over 20057.00 frames. ], tot_loss[loss=0.2121, simple_loss=0.2926, pruned_loss=0.06582, over 4273797.64 frames. ], batch size: 704, lr: 3.79e-03, grad_scale: 32.0 2023-06-25 12:13:19,575 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1331472.0, ans=0.0 2023-06-25 12:14:45,598 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.627e+02 3.841e+02 5.103e+02 7.112e+02 1.474e+03, threshold=1.021e+03, percent-clipped=11.0 2023-06-25 12:15:04,505 INFO [train.py:996] (0/4) Epoch 8, batch 8500, loss[loss=0.2427, simple_loss=0.2925, pruned_loss=0.09641, over 21365.00 frames. ], tot_loss[loss=0.2123, simple_loss=0.2899, pruned_loss=0.06738, over 4275767.29 frames. ], batch size: 473, lr: 3.79e-03, grad_scale: 32.0 2023-06-25 12:15:11,129 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.62 vs. limit=15.0 2023-06-25 12:16:10,035 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1331952.0, ans=0.125 2023-06-25 12:16:22,766 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1331952.0, ans=0.125 2023-06-25 12:16:37,056 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1332012.0, ans=0.125 2023-06-25 12:16:56,659 INFO [train.py:996] (0/4) Epoch 8, batch 8550, loss[loss=0.2438, simple_loss=0.3258, pruned_loss=0.0809, over 21410.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.2957, pruned_loss=0.07043, over 4276061.44 frames. ], batch size: 211, lr: 3.78e-03, grad_scale: 16.0 2023-06-25 12:16:59,388 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1332072.0, ans=0.125 2023-06-25 12:17:24,086 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1332132.0, ans=0.0 2023-06-25 12:17:39,857 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1332132.0, ans=0.0 2023-06-25 12:18:36,643 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.783e+02 4.136e+02 5.316e+02 7.631e+02 1.468e+03, threshold=1.063e+03, percent-clipped=11.0 2023-06-25 12:18:44,400 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1332312.0, ans=0.1 2023-06-25 12:18:52,714 INFO [train.py:996] (0/4) Epoch 8, batch 8600, loss[loss=0.2669, simple_loss=0.3446, pruned_loss=0.0946, over 21533.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.3014, pruned_loss=0.07152, over 4274012.71 frames. ], batch size: 131, lr: 3.78e-03, grad_scale: 16.0 2023-06-25 12:19:00,170 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1332372.0, ans=0.2 2023-06-25 12:19:03,751 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1332372.0, ans=0.125 2023-06-25 12:19:31,190 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.42 vs. limit=15.0 2023-06-25 12:19:39,838 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1332492.0, ans=0.125 2023-06-25 12:19:41,414 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1332492.0, ans=0.07 2023-06-25 12:20:17,837 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1332552.0, ans=0.0 2023-06-25 12:20:38,890 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1332612.0, ans=0.125 2023-06-25 12:20:43,319 INFO [train.py:996] (0/4) Epoch 8, batch 8650, loss[loss=0.2095, simple_loss=0.3106, pruned_loss=0.05421, over 21647.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.308, pruned_loss=0.07245, over 4272513.20 frames. ], batch size: 441, lr: 3.78e-03, grad_scale: 16.0 2023-06-25 12:21:23,581 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1332792.0, ans=0.125 2023-06-25 12:21:27,938 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.20 vs. limit=15.0 2023-06-25 12:21:41,088 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1332852.0, ans=0.0 2023-06-25 12:22:09,060 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.16 vs. limit=15.0 2023-06-25 12:22:16,162 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.514e+02 3.872e+02 5.286e+02 7.583e+02 1.337e+03, threshold=1.057e+03, percent-clipped=5.0 2023-06-25 12:22:32,429 INFO [train.py:996] (0/4) Epoch 8, batch 8700, loss[loss=0.2159, simple_loss=0.2748, pruned_loss=0.07855, over 21458.00 frames. ], tot_loss[loss=0.219, simple_loss=0.2996, pruned_loss=0.06922, over 4277685.70 frames. ], batch size: 441, lr: 3.78e-03, grad_scale: 16.0 2023-06-25 12:22:36,153 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1332972.0, ans=0.125 2023-06-25 12:23:44,430 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1333152.0, ans=0.0 2023-06-25 12:24:21,808 INFO [train.py:996] (0/4) Epoch 8, batch 8750, loss[loss=0.2178, simple_loss=0.2847, pruned_loss=0.07543, over 21865.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2959, pruned_loss=0.0699, over 4286830.38 frames. ], batch size: 351, lr: 3.78e-03, grad_scale: 16.0 2023-06-25 12:24:35,183 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1333272.0, ans=0.1 2023-06-25 12:24:52,499 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1333332.0, ans=0.125 2023-06-25 12:25:10,278 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1333392.0, ans=0.1 2023-06-25 12:25:24,806 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1333452.0, ans=0.0 2023-06-25 12:26:02,041 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.799e+02 3.947e+02 5.629e+02 7.790e+02 1.713e+03, threshold=1.126e+03, percent-clipped=18.0 2023-06-25 12:26:06,199 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1333512.0, ans=0.125 2023-06-25 12:26:08,017 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1333512.0, ans=0.1 2023-06-25 12:26:18,058 INFO [train.py:996] (0/4) Epoch 8, batch 8800, loss[loss=0.2334, simple_loss=0.3111, pruned_loss=0.07783, over 21626.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.3025, pruned_loss=0.07261, over 4288396.38 frames. ], batch size: 263, lr: 3.78e-03, grad_scale: 32.0 2023-06-25 12:26:25,939 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1333572.0, ans=0.0 2023-06-25 12:26:30,188 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.28 vs. limit=12.0 2023-06-25 12:26:35,467 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1333632.0, ans=0.2 2023-06-25 12:26:40,139 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=21.43 vs. limit=22.5 2023-06-25 12:26:46,729 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1333632.0, ans=0.125 2023-06-25 12:26:59,676 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.69 vs. limit=15.0 2023-06-25 12:27:07,109 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=6.56 vs. limit=22.5 2023-06-25 12:28:09,032 INFO [train.py:996] (0/4) Epoch 8, batch 8850, loss[loss=0.2235, simple_loss=0.2785, pruned_loss=0.08427, over 20119.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.3098, pruned_loss=0.07457, over 4275260.18 frames. ], batch size: 702, lr: 3.78e-03, grad_scale: 16.0 2023-06-25 12:28:19,823 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.31 vs. limit=6.0 2023-06-25 12:28:24,069 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1333872.0, ans=0.0 2023-06-25 12:29:11,028 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1333992.0, ans=0.2 2023-06-25 12:29:13,389 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1334052.0, ans=0.125 2023-06-25 12:29:51,865 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.618e+02 3.568e+02 4.882e+02 6.738e+02 2.080e+03, threshold=9.764e+02, percent-clipped=3.0 2023-06-25 12:29:52,993 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1334112.0, ans=0.0 2023-06-25 12:29:58,306 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1334112.0, ans=0.0 2023-06-25 12:30:01,478 INFO [train.py:996] (0/4) Epoch 8, batch 8900, loss[loss=0.1958, simple_loss=0.2636, pruned_loss=0.06395, over 21778.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.3029, pruned_loss=0.07316, over 4275771.66 frames. ], batch size: 317, lr: 3.78e-03, grad_scale: 16.0 2023-06-25 12:30:05,538 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1334172.0, ans=0.125 2023-06-25 12:30:11,393 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1334172.0, ans=0.0 2023-06-25 12:31:11,466 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1334352.0, ans=0.0 2023-06-25 12:31:59,188 INFO [train.py:996] (0/4) Epoch 8, batch 8950, loss[loss=0.2066, simple_loss=0.2716, pruned_loss=0.07078, over 21359.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.3031, pruned_loss=0.07213, over 4272349.86 frames. ], batch size: 194, lr: 3.78e-03, grad_scale: 16.0 2023-06-25 12:32:01,919 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1334472.0, ans=0.0 2023-06-25 12:32:57,896 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1334592.0, ans=0.1 2023-06-25 12:33:08,070 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1334652.0, ans=0.2 2023-06-25 12:33:13,649 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1334652.0, ans=0.125 2023-06-25 12:33:25,734 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1334712.0, ans=0.125 2023-06-25 12:33:34,163 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.718e+02 4.076e+02 6.080e+02 7.762e+02 1.933e+03, threshold=1.216e+03, percent-clipped=14.0 2023-06-25 12:33:34,938 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1334712.0, ans=0.125 2023-06-25 12:33:48,754 INFO [train.py:996] (0/4) Epoch 8, batch 9000, loss[loss=0.212, simple_loss=0.2884, pruned_loss=0.06782, over 21603.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.2979, pruned_loss=0.07168, over 4262732.02 frames. ], batch size: 391, lr: 3.78e-03, grad_scale: 16.0 2023-06-25 12:33:48,755 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-25 12:34:07,160 INFO [train.py:1028] (0/4) Epoch 8, validation: loss=0.2631, simple_loss=0.3554, pruned_loss=0.08544, over 1796401.00 frames. 2023-06-25 12:34:07,162 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23714MB 2023-06-25 12:34:11,548 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1334772.0, ans=0.0 2023-06-25 12:34:33,102 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1334832.0, ans=0.1 2023-06-25 12:35:23,121 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1334952.0, ans=0.2 2023-06-25 12:35:57,398 INFO [train.py:996] (0/4) Epoch 8, batch 9050, loss[loss=0.2131, simple_loss=0.2948, pruned_loss=0.06566, over 21775.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.2946, pruned_loss=0.06944, over 4258201.88 frames. ], batch size: 282, lr: 3.78e-03, grad_scale: 16.0 2023-06-25 12:36:05,768 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1335072.0, ans=0.035 2023-06-25 12:36:13,251 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1335072.0, ans=0.0 2023-06-25 12:36:42,238 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1335132.0, ans=0.1 2023-06-25 12:36:53,545 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1335192.0, ans=0.0 2023-06-25 12:37:01,167 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.53 vs. limit=15.0 2023-06-25 12:37:09,714 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1335252.0, ans=0.125 2023-06-25 12:37:47,100 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.667e+02 3.976e+02 5.366e+02 7.574e+02 1.688e+03, threshold=1.073e+03, percent-clipped=5.0 2023-06-25 12:37:55,895 INFO [train.py:996] (0/4) Epoch 8, batch 9100, loss[loss=0.262, simple_loss=0.3408, pruned_loss=0.09164, over 21319.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.2982, pruned_loss=0.07145, over 4257489.80 frames. ], batch size: 549, lr: 3.78e-03, grad_scale: 16.0 2023-06-25 12:38:16,453 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.72 vs. limit=15.0 2023-06-25 12:38:53,301 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1335492.0, ans=0.04949747468305833 2023-06-25 12:38:53,357 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1335492.0, ans=0.2 2023-06-25 12:39:47,142 INFO [train.py:996] (0/4) Epoch 8, batch 9150, loss[loss=0.2829, simple_loss=0.3701, pruned_loss=0.09785, over 21633.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.3033, pruned_loss=0.07015, over 4254214.22 frames. ], batch size: 441, lr: 3.78e-03, grad_scale: 16.0 2023-06-25 12:40:12,461 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.31 vs. limit=15.0 2023-06-25 12:41:26,801 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1335912.0, ans=0.0 2023-06-25 12:41:27,974 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.600e+02 3.582e+02 4.285e+02 5.759e+02 1.145e+03, threshold=8.570e+02, percent-clipped=4.0 2023-06-25 12:41:47,526 INFO [train.py:996] (0/4) Epoch 8, batch 9200, loss[loss=0.2698, simple_loss=0.355, pruned_loss=0.0923, over 21610.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.3073, pruned_loss=0.07002, over 4262033.19 frames. ], batch size: 414, lr: 3.78e-03, grad_scale: 32.0 2023-06-25 12:41:59,270 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.45 vs. limit=15.0 2023-06-25 12:42:04,136 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1336032.0, ans=0.125 2023-06-25 12:42:05,805 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1336032.0, ans=0.0 2023-06-25 12:42:37,582 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1336092.0, ans=0.025 2023-06-25 12:42:50,212 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1336152.0, ans=0.125 2023-06-25 12:43:08,093 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=1336212.0, ans=10.0 2023-06-25 12:43:37,074 INFO [train.py:996] (0/4) Epoch 8, batch 9250, loss[loss=0.2281, simple_loss=0.2987, pruned_loss=0.07877, over 21195.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.3092, pruned_loss=0.07291, over 4256339.24 frames. ], batch size: 143, lr: 3.78e-03, grad_scale: 32.0 2023-06-25 12:44:41,079 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1336452.0, ans=10.0 2023-06-25 12:44:51,733 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1336452.0, ans=0.125 2023-06-25 12:45:21,452 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.801e+02 3.683e+02 5.339e+02 7.868e+02 1.539e+03, threshold=1.068e+03, percent-clipped=20.0 2023-06-25 12:45:28,201 INFO [train.py:996] (0/4) Epoch 8, batch 9300, loss[loss=0.1897, simple_loss=0.2569, pruned_loss=0.06129, over 21498.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.3029, pruned_loss=0.07247, over 4257224.60 frames. ], batch size: 230, lr: 3.78e-03, grad_scale: 16.0 2023-06-25 12:46:00,763 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1336632.0, ans=0.125 2023-06-25 12:46:03,091 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.99 vs. limit=15.0 2023-06-25 12:47:19,154 INFO [train.py:996] (0/4) Epoch 8, batch 9350, loss[loss=0.2275, simple_loss=0.3158, pruned_loss=0.06964, over 21412.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.3081, pruned_loss=0.07325, over 4262296.87 frames. ], batch size: 131, lr: 3.78e-03, grad_scale: 16.0 2023-06-25 12:47:50,419 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1336932.0, ans=0.0 2023-06-25 12:49:02,856 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.027e+02 4.116e+02 5.791e+02 8.209e+02 2.175e+03, threshold=1.158e+03, percent-clipped=13.0 2023-06-25 12:49:10,235 INFO [train.py:996] (0/4) Epoch 8, batch 9400, loss[loss=0.2242, simple_loss=0.2955, pruned_loss=0.07645, over 20158.00 frames. ], tot_loss[loss=0.23, simple_loss=0.3101, pruned_loss=0.07495, over 4262502.62 frames. ], batch size: 702, lr: 3.78e-03, grad_scale: 16.0 2023-06-25 12:49:10,860 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1337172.0, ans=0.1 2023-06-25 12:49:21,891 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1337172.0, ans=0.125 2023-06-25 12:49:22,373 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.89 vs. limit=22.5 2023-06-25 12:50:28,839 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1337352.0, ans=0.125 2023-06-25 12:50:35,048 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.37 vs. limit=15.0 2023-06-25 12:51:05,929 INFO [train.py:996] (0/4) Epoch 8, batch 9450, loss[loss=0.2093, simple_loss=0.2679, pruned_loss=0.07538, over 21868.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.3005, pruned_loss=0.073, over 4264757.86 frames. ], batch size: 373, lr: 3.78e-03, grad_scale: 16.0 2023-06-25 12:51:21,831 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1337532.0, ans=0.05 2023-06-25 12:51:26,086 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=1337532.0, ans=6.0 2023-06-25 12:51:34,719 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 12:52:24,780 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1337652.0, ans=0.1 2023-06-25 12:52:41,865 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.789e+02 4.276e+02 5.565e+02 7.806e+02 1.820e+03, threshold=1.113e+03, percent-clipped=7.0 2023-06-25 12:52:44,805 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.60 vs. limit=10.0 2023-06-25 12:52:48,839 INFO [train.py:996] (0/4) Epoch 8, batch 9500, loss[loss=0.2229, simple_loss=0.2846, pruned_loss=0.08056, over 21850.00 frames. ], tot_loss[loss=0.2181, simple_loss=0.2943, pruned_loss=0.07096, over 4259118.07 frames. ], batch size: 107, lr: 3.78e-03, grad_scale: 16.0 2023-06-25 12:53:32,037 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1337832.0, ans=0.0 2023-06-25 12:53:32,095 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1337832.0, ans=0.125 2023-06-25 12:54:01,131 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1337892.0, ans=0.0 2023-06-25 12:54:13,461 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1337952.0, ans=0.0 2023-06-25 12:54:18,365 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1337952.0, ans=0.2 2023-06-25 12:54:25,061 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1338012.0, ans=0.025 2023-06-25 12:54:30,457 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1338012.0, ans=0.0 2023-06-25 12:54:43,660 INFO [train.py:996] (0/4) Epoch 8, batch 9550, loss[loss=0.2473, simple_loss=0.3253, pruned_loss=0.08466, over 21748.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.298, pruned_loss=0.07252, over 4255955.19 frames. ], batch size: 332, lr: 3.78e-03, grad_scale: 16.0 2023-06-25 12:54:56,806 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1338072.0, ans=0.125 2023-06-25 12:55:32,762 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.04 vs. limit=22.5 2023-06-25 12:56:26,044 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.967e+02 4.048e+02 5.374e+02 8.215e+02 1.903e+03, threshold=1.075e+03, percent-clipped=10.0 2023-06-25 12:56:32,896 INFO [train.py:996] (0/4) Epoch 8, batch 9600, loss[loss=0.2181, simple_loss=0.2787, pruned_loss=0.07872, over 21598.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.3001, pruned_loss=0.07386, over 4263109.39 frames. ], batch size: 548, lr: 3.78e-03, grad_scale: 32.0 2023-06-25 12:57:20,265 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1338492.0, ans=0.125 2023-06-25 12:57:29,132 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1338492.0, ans=0.0 2023-06-25 12:58:24,371 INFO [train.py:996] (0/4) Epoch 8, batch 9650, loss[loss=0.2275, simple_loss=0.3019, pruned_loss=0.07653, over 21733.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.3003, pruned_loss=0.0737, over 4267683.62 frames. ], batch size: 298, lr: 3.78e-03, grad_scale: 32.0 2023-06-25 12:58:41,046 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 12:58:47,881 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1338732.0, ans=0.2 2023-06-25 12:58:53,649 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1338732.0, ans=0.125 2023-06-25 12:59:23,176 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.94 vs. limit=22.5 2023-06-25 12:59:38,374 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1338852.0, ans=0.5 2023-06-25 13:00:07,428 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.726e+02 3.684e+02 4.580e+02 6.595e+02 1.807e+03, threshold=9.160e+02, percent-clipped=4.0 2023-06-25 13:00:20,064 INFO [train.py:996] (0/4) Epoch 8, batch 9700, loss[loss=0.2159, simple_loss=0.2897, pruned_loss=0.07103, over 21791.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.3045, pruned_loss=0.07469, over 4271546.39 frames. ], batch size: 124, lr: 3.78e-03, grad_scale: 32.0 2023-06-25 13:00:34,332 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1338972.0, ans=0.07 2023-06-25 13:02:02,392 INFO [train.py:996] (0/4) Epoch 8, batch 9750, loss[loss=0.2655, simple_loss=0.3615, pruned_loss=0.08479, over 21833.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.2991, pruned_loss=0.07311, over 4273225.60 frames. ], batch size: 118, lr: 3.77e-03, grad_scale: 32.0 2023-06-25 13:02:15,387 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1339272.0, ans=0.125 2023-06-25 13:02:43,818 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.70 vs. limit=15.0 2023-06-25 13:02:59,929 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1339392.0, ans=0.125 2023-06-25 13:03:12,393 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1339452.0, ans=0.125 2023-06-25 13:03:42,355 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.775e+02 3.743e+02 5.532e+02 7.768e+02 2.224e+03, threshold=1.106e+03, percent-clipped=14.0 2023-06-25 13:03:49,315 INFO [train.py:996] (0/4) Epoch 8, batch 9800, loss[loss=0.2166, simple_loss=0.2888, pruned_loss=0.07224, over 21656.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.298, pruned_loss=0.07275, over 4276799.96 frames. ], batch size: 230, lr: 3.77e-03, grad_scale: 32.0 2023-06-25 13:04:24,728 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1339632.0, ans=0.125 2023-06-25 13:04:29,861 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1339692.0, ans=0.2 2023-06-25 13:04:42,241 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1339692.0, ans=0.2 2023-06-25 13:05:22,309 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1339812.0, ans=0.125 2023-06-25 13:05:22,329 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1339812.0, ans=0.125 2023-06-25 13:05:37,669 INFO [train.py:996] (0/4) Epoch 8, batch 9850, loss[loss=0.2115, simple_loss=0.3059, pruned_loss=0.05852, over 16029.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.2946, pruned_loss=0.07319, over 4277515.71 frames. ], batch size: 60, lr: 3.77e-03, grad_scale: 32.0 2023-06-25 13:05:38,271 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1339872.0, ans=0.125 2023-06-25 13:06:55,756 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1340052.0, ans=0.125 2023-06-25 13:07:19,913 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.843e+02 3.728e+02 4.692e+02 6.683e+02 1.521e+03, threshold=9.384e+02, percent-clipped=6.0 2023-06-25 13:07:26,595 INFO [train.py:996] (0/4) Epoch 8, batch 9900, loss[loss=0.2725, simple_loss=0.3359, pruned_loss=0.1046, over 21376.00 frames. ], tot_loss[loss=0.2181, simple_loss=0.2911, pruned_loss=0.07257, over 4269546.53 frames. ], batch size: 471, lr: 3.77e-03, grad_scale: 32.0 2023-06-25 13:07:37,588 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1340172.0, ans=0.125 2023-06-25 13:09:14,630 INFO [train.py:996] (0/4) Epoch 8, batch 9950, loss[loss=0.2232, simple_loss=0.2863, pruned_loss=0.0801, over 21705.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.2953, pruned_loss=0.07469, over 4256730.20 frames. ], batch size: 112, lr: 3.77e-03, grad_scale: 16.0 2023-06-25 13:10:26,774 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1340652.0, ans=0.1 2023-06-25 13:10:46,657 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1340712.0, ans=0.2 2023-06-25 13:10:58,853 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1340712.0, ans=0.2 2023-06-25 13:10:59,875 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.606e+02 3.700e+02 4.924e+02 7.179e+02 1.701e+03, threshold=9.849e+02, percent-clipped=16.0 2023-06-25 13:11:11,636 INFO [train.py:996] (0/4) Epoch 8, batch 10000, loss[loss=0.2123, simple_loss=0.2795, pruned_loss=0.07256, over 20003.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.2912, pruned_loss=0.07327, over 4259867.42 frames. ], batch size: 703, lr: 3.77e-03, grad_scale: 32.0 2023-06-25 13:11:54,779 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1340832.0, ans=0.5 2023-06-25 13:12:01,938 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1340892.0, ans=0.125 2023-06-25 13:12:19,175 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.35 vs. limit=15.0 2023-06-25 13:12:30,280 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1340952.0, ans=0.125 2023-06-25 13:12:39,379 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1341012.0, ans=0.125 2023-06-25 13:13:02,334 INFO [train.py:996] (0/4) Epoch 8, batch 10050, loss[loss=0.1728, simple_loss=0.2565, pruned_loss=0.04453, over 21745.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.2922, pruned_loss=0.07311, over 4260424.29 frames. ], batch size: 282, lr: 3.77e-03, grad_scale: 32.0 2023-06-25 13:13:34,319 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1341132.0, ans=0.0 2023-06-25 13:13:36,367 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1341132.0, ans=0.0 2023-06-25 13:13:41,507 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1341132.0, ans=0.05 2023-06-25 13:13:47,501 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.27 vs. limit=10.0 2023-06-25 13:14:25,719 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1341252.0, ans=0.2 2023-06-25 13:14:42,007 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1341312.0, ans=0.125 2023-06-25 13:14:47,098 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1341312.0, ans=0.0 2023-06-25 13:14:55,225 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.655e+02 4.346e+02 5.951e+02 7.848e+02 1.633e+03, threshold=1.190e+03, percent-clipped=16.0 2023-06-25 13:14:58,758 INFO [train.py:996] (0/4) Epoch 8, batch 10100, loss[loss=0.2247, simple_loss=0.309, pruned_loss=0.07018, over 21651.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2875, pruned_loss=0.0704, over 4261075.27 frames. ], batch size: 414, lr: 3.77e-03, grad_scale: 16.0 2023-06-25 13:15:05,526 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.41 vs. limit=15.0 2023-06-25 13:15:13,399 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1341372.0, ans=0.125 2023-06-25 13:15:27,327 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1341432.0, ans=0.125 2023-06-25 13:15:29,199 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1341432.0, ans=0.0 2023-06-25 13:15:32,567 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1341432.0, ans=0.125 2023-06-25 13:16:48,276 INFO [train.py:996] (0/4) Epoch 8, batch 10150, loss[loss=0.2381, simple_loss=0.3115, pruned_loss=0.08233, over 21634.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.2938, pruned_loss=0.07282, over 4261156.06 frames. ], batch size: 441, lr: 3.77e-03, grad_scale: 16.0 2023-06-25 13:17:03,283 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.03 vs. limit=15.0 2023-06-25 13:17:30,379 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1341732.0, ans=0.125 2023-06-25 13:17:46,340 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1341792.0, ans=0.05 2023-06-25 13:18:13,354 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.50 vs. limit=10.0 2023-06-25 13:18:38,992 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.326e+02 3.458e+02 4.384e+02 5.388e+02 1.096e+03, threshold=8.768e+02, percent-clipped=0.0 2023-06-25 13:18:42,758 INFO [train.py:996] (0/4) Epoch 8, batch 10200, loss[loss=0.2054, simple_loss=0.2932, pruned_loss=0.05879, over 21724.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.2934, pruned_loss=0.07154, over 4266940.15 frames. ], batch size: 351, lr: 3.77e-03, grad_scale: 16.0 2023-06-25 13:18:58,331 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.25 vs. limit=15.0 2023-06-25 13:18:59,593 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1342032.0, ans=0.0 2023-06-25 13:19:28,435 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.84 vs. limit=12.0 2023-06-25 13:20:09,493 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1342212.0, ans=0.0 2023-06-25 13:20:10,135 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.97 vs. limit=15.0 2023-06-25 13:20:34,631 INFO [train.py:996] (0/4) Epoch 8, batch 10250, loss[loss=0.1888, simple_loss=0.2699, pruned_loss=0.05386, over 21621.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2893, pruned_loss=0.06655, over 4275871.83 frames. ], batch size: 263, lr: 3.77e-03, grad_scale: 16.0 2023-06-25 13:20:44,547 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1342272.0, ans=0.0 2023-06-25 13:20:48,009 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1342272.0, ans=0.0 2023-06-25 13:21:51,719 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1342452.0, ans=0.0 2023-06-25 13:21:53,916 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.98 vs. limit=15.0 2023-06-25 13:21:56,985 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1342452.0, ans=0.05 2023-06-25 13:22:13,055 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1342512.0, ans=0.0 2023-06-25 13:22:23,176 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.001e+02 3.628e+02 5.027e+02 6.947e+02 1.354e+03, threshold=1.005e+03, percent-clipped=10.0 2023-06-25 13:22:26,730 INFO [train.py:996] (0/4) Epoch 8, batch 10300, loss[loss=0.225, simple_loss=0.3054, pruned_loss=0.07232, over 21259.00 frames. ], tot_loss[loss=0.2133, simple_loss=0.2917, pruned_loss=0.06743, over 4272426.80 frames. ], batch size: 176, lr: 3.77e-03, grad_scale: 16.0 2023-06-25 13:23:27,748 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1342692.0, ans=0.0 2023-06-25 13:23:58,121 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.46 vs. limit=22.5 2023-06-25 13:24:18,485 INFO [train.py:996] (0/4) Epoch 8, batch 10350, loss[loss=0.2184, simple_loss=0.3049, pruned_loss=0.06591, over 21598.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.293, pruned_loss=0.06738, over 4270818.18 frames. ], batch size: 389, lr: 3.77e-03, grad_scale: 16.0 2023-06-25 13:24:40,693 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1342932.0, ans=0.125 2023-06-25 13:24:44,054 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 13:25:31,977 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1343052.0, ans=0.125 2023-06-25 13:26:05,203 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.961e+02 4.465e+02 6.325e+02 1.027e+03 2.051e+03, threshold=1.265e+03, percent-clipped=26.0 2023-06-25 13:26:05,955 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1343112.0, ans=0.125 2023-06-25 13:26:15,284 INFO [train.py:996] (0/4) Epoch 8, batch 10400, loss[loss=0.2193, simple_loss=0.2915, pruned_loss=0.07355, over 21728.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.2879, pruned_loss=0.06691, over 4276646.78 frames. ], batch size: 391, lr: 3.77e-03, grad_scale: 32.0 2023-06-25 13:26:19,371 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1343172.0, ans=0.125 2023-06-25 13:26:31,662 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1343232.0, ans=0.125 2023-06-25 13:28:06,159 INFO [train.py:996] (0/4) Epoch 8, batch 10450, loss[loss=0.2114, simple_loss=0.2866, pruned_loss=0.06811, over 21409.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2922, pruned_loss=0.06982, over 4270898.35 frames. ], batch size: 131, lr: 3.77e-03, grad_scale: 32.0 2023-06-25 13:28:43,317 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1343532.0, ans=0.0 2023-06-25 13:29:52,718 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.857e+02 4.046e+02 6.081e+02 8.924e+02 1.860e+03, threshold=1.216e+03, percent-clipped=7.0 2023-06-25 13:29:54,320 INFO [train.py:996] (0/4) Epoch 8, batch 10500, loss[loss=0.2201, simple_loss=0.2806, pruned_loss=0.07977, over 21179.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2919, pruned_loss=0.06921, over 4270237.61 frames. ], batch size: 143, lr: 3.77e-03, grad_scale: 16.0 2023-06-25 13:29:58,489 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1343772.0, ans=0.0 2023-06-25 13:30:49,334 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1343892.0, ans=0.2 2023-06-25 13:31:14,827 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-224000.pt 2023-06-25 13:31:44,417 INFO [train.py:996] (0/4) Epoch 8, batch 10550, loss[loss=0.1991, simple_loss=0.265, pruned_loss=0.06659, over 21760.00 frames. ], tot_loss[loss=0.212, simple_loss=0.2867, pruned_loss=0.06862, over 4273098.53 frames. ], batch size: 124, lr: 3.77e-03, grad_scale: 16.0 2023-06-25 13:31:52,125 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1344072.0, ans=0.0 2023-06-25 13:32:22,411 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1344132.0, ans=0.125 2023-06-25 13:32:30,184 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1344132.0, ans=0.125 2023-06-25 13:32:39,464 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1344192.0, ans=0.125 2023-06-25 13:33:29,346 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.13 vs. limit=22.5 2023-06-25 13:33:35,044 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.430e+02 3.879e+02 5.008e+02 7.044e+02 1.478e+03, threshold=1.002e+03, percent-clipped=2.0 2023-06-25 13:33:37,138 INFO [train.py:996] (0/4) Epoch 8, batch 10600, loss[loss=0.2216, simple_loss=0.3151, pruned_loss=0.06412, over 21477.00 frames. ], tot_loss[loss=0.2088, simple_loss=0.283, pruned_loss=0.06729, over 4268876.98 frames. ], batch size: 471, lr: 3.77e-03, grad_scale: 16.0 2023-06-25 13:34:02,259 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1344372.0, ans=0.125 2023-06-25 13:34:14,535 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1344432.0, ans=0.125 2023-06-25 13:34:36,138 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1344492.0, ans=0.0 2023-06-25 13:35:16,959 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1344612.0, ans=0.125 2023-06-25 13:35:34,416 INFO [train.py:996] (0/4) Epoch 8, batch 10650, loss[loss=0.1807, simple_loss=0.2648, pruned_loss=0.04828, over 21747.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.2859, pruned_loss=0.0662, over 4272154.74 frames. ], batch size: 332, lr: 3.77e-03, grad_scale: 16.0 2023-06-25 13:35:52,750 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1344672.0, ans=0.125 2023-06-25 13:36:22,154 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1344792.0, ans=0.5 2023-06-25 13:37:23,245 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.596e+02 3.773e+02 5.055e+02 6.605e+02 1.042e+03, threshold=1.011e+03, percent-clipped=1.0 2023-06-25 13:37:30,131 INFO [train.py:996] (0/4) Epoch 8, batch 10700, loss[loss=0.272, simple_loss=0.3402, pruned_loss=0.1019, over 21402.00 frames. ], tot_loss[loss=0.2075, simple_loss=0.2838, pruned_loss=0.06561, over 4266746.19 frames. ], batch size: 471, lr: 3.77e-03, grad_scale: 16.0 2023-06-25 13:37:30,703 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1344972.0, ans=0.125 2023-06-25 13:37:34,023 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1344972.0, ans=0.0 2023-06-25 13:38:19,416 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.30 vs. limit=15.0 2023-06-25 13:39:22,496 INFO [train.py:996] (0/4) Epoch 8, batch 10750, loss[loss=0.2322, simple_loss=0.3156, pruned_loss=0.07446, over 21415.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.2937, pruned_loss=0.06955, over 4269514.70 frames. ], batch size: 131, lr: 3.77e-03, grad_scale: 16.0 2023-06-25 13:40:33,978 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.84 vs. limit=22.5 2023-06-25 13:40:58,214 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.82 vs. limit=15.0 2023-06-25 13:41:05,930 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.724e+02 3.874e+02 4.652e+02 6.783e+02 1.933e+03, threshold=9.304e+02, percent-clipped=9.0 2023-06-25 13:41:08,315 INFO [train.py:996] (0/4) Epoch 8, batch 10800, loss[loss=0.2595, simple_loss=0.3347, pruned_loss=0.09217, over 21680.00 frames. ], tot_loss[loss=0.22, simple_loss=0.2995, pruned_loss=0.07024, over 4270432.86 frames. ], batch size: 351, lr: 3.77e-03, grad_scale: 32.0 2023-06-25 13:41:25,358 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1345632.0, ans=0.125 2023-06-25 13:41:36,095 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1345632.0, ans=0.0 2023-06-25 13:41:39,456 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1345632.0, ans=0.1 2023-06-25 13:42:08,409 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1345692.0, ans=0.125 2023-06-25 13:42:53,471 INFO [train.py:996] (0/4) Epoch 8, batch 10850, loss[loss=0.2258, simple_loss=0.2877, pruned_loss=0.0819, over 21306.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.3, pruned_loss=0.07047, over 4272437.39 frames. ], batch size: 144, lr: 3.77e-03, grad_scale: 16.0 2023-06-25 13:42:57,757 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1345872.0, ans=0.0 2023-06-25 13:43:03,385 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1345872.0, ans=0.95 2023-06-25 13:44:33,206 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1346112.0, ans=0.0 2023-06-25 13:44:43,293 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.620e+02 4.154e+02 5.827e+02 8.227e+02 1.341e+03, threshold=1.165e+03, percent-clipped=17.0 2023-06-25 13:44:43,325 INFO [train.py:996] (0/4) Epoch 8, batch 10900, loss[loss=0.1908, simple_loss=0.2827, pruned_loss=0.04944, over 21703.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.2923, pruned_loss=0.06875, over 4272582.81 frames. ], batch size: 247, lr: 3.77e-03, grad_scale: 16.0 2023-06-25 13:45:51,432 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1346352.0, ans=0.125 2023-06-25 13:46:02,098 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1346352.0, ans=0.2 2023-06-25 13:46:27,399 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1346472.0, ans=0.1 2023-06-25 13:46:28,367 INFO [train.py:996] (0/4) Epoch 8, batch 10950, loss[loss=0.2437, simple_loss=0.2897, pruned_loss=0.09884, over 21278.00 frames. ], tot_loss[loss=0.212, simple_loss=0.2891, pruned_loss=0.06741, over 4264575.82 frames. ], batch size: 507, lr: 3.76e-03, grad_scale: 16.0 2023-06-25 13:46:43,759 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.96 vs. limit=15.0 2023-06-25 13:47:39,860 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1346652.0, ans=0.1 2023-06-25 13:47:59,444 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.09 vs. limit=10.0 2023-06-25 13:48:10,259 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.690e+02 3.805e+02 5.172e+02 7.672e+02 1.562e+03, threshold=1.034e+03, percent-clipped=4.0 2023-06-25 13:48:10,290 INFO [train.py:996] (0/4) Epoch 8, batch 11000, loss[loss=0.2253, simple_loss=0.2942, pruned_loss=0.07822, over 21884.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.2877, pruned_loss=0.06784, over 4266088.50 frames. ], batch size: 414, lr: 3.76e-03, grad_scale: 16.0 2023-06-25 13:48:19,396 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1346772.0, ans=0.2 2023-06-25 13:48:35,548 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1346832.0, ans=0.0 2023-06-25 13:48:37,663 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.08 vs. limit=22.5 2023-06-25 13:48:42,513 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1346832.0, ans=0.0 2023-06-25 13:49:14,167 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1346892.0, ans=0.0 2023-06-25 13:49:24,964 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1346952.0, ans=0.125 2023-06-25 13:49:27,071 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1346952.0, ans=0.0 2023-06-25 13:49:30,520 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1346952.0, ans=0.125 2023-06-25 13:49:40,967 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1347012.0, ans=0.0 2023-06-25 13:49:56,432 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1347012.0, ans=0.2 2023-06-25 13:49:59,325 INFO [train.py:996] (0/4) Epoch 8, batch 11050, loss[loss=0.1988, simple_loss=0.2631, pruned_loss=0.06724, over 21489.00 frames. ], tot_loss[loss=0.2107, simple_loss=0.2845, pruned_loss=0.06843, over 4274936.21 frames. ], batch size: 195, lr: 3.76e-03, grad_scale: 16.0 2023-06-25 13:50:02,322 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.75 vs. limit=10.0 2023-06-25 13:50:05,272 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1347072.0, ans=0.0 2023-06-25 13:50:55,512 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1347192.0, ans=0.2 2023-06-25 13:51:24,537 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1347252.0, ans=0.1 2023-06-25 13:51:29,690 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1347252.0, ans=0.0 2023-06-25 13:51:50,003 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.876e+02 3.834e+02 4.608e+02 6.864e+02 1.206e+03, threshold=9.217e+02, percent-clipped=3.0 2023-06-25 13:51:50,048 INFO [train.py:996] (0/4) Epoch 8, batch 11100, loss[loss=0.2195, simple_loss=0.278, pruned_loss=0.08048, over 20062.00 frames. ], tot_loss[loss=0.2108, simple_loss=0.2841, pruned_loss=0.06874, over 4279434.41 frames. ], batch size: 703, lr: 3.76e-03, grad_scale: 16.0 2023-06-25 13:51:52,312 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1347372.0, ans=0.0 2023-06-25 13:53:05,314 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1347552.0, ans=0.125 2023-06-25 13:53:15,418 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1347552.0, ans=0.0 2023-06-25 13:53:39,308 INFO [train.py:996] (0/4) Epoch 8, batch 11150, loss[loss=0.2107, simple_loss=0.3032, pruned_loss=0.05914, over 21802.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2829, pruned_loss=0.0687, over 4272104.94 frames. ], batch size: 317, lr: 3.76e-03, grad_scale: 16.0 2023-06-25 13:53:39,875 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1347672.0, ans=0.0 2023-06-25 13:53:45,203 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1347672.0, ans=0.0 2023-06-25 13:54:11,787 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1347732.0, ans=0.2 2023-06-25 13:55:22,811 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.920e+02 3.496e+02 4.428e+02 6.433e+02 1.139e+03, threshold=8.857e+02, percent-clipped=2.0 2023-06-25 13:55:22,858 INFO [train.py:996] (0/4) Epoch 8, batch 11200, loss[loss=0.2025, simple_loss=0.2636, pruned_loss=0.07069, over 21384.00 frames. ], tot_loss[loss=0.2097, simple_loss=0.2819, pruned_loss=0.06877, over 4262680.56 frames. ], batch size: 212, lr: 3.76e-03, grad_scale: 32.0 2023-06-25 13:56:18,285 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1348092.0, ans=0.125 2023-06-25 13:56:30,338 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1348152.0, ans=0.1 2023-06-25 13:57:10,460 INFO [train.py:996] (0/4) Epoch 8, batch 11250, loss[loss=0.2496, simple_loss=0.2972, pruned_loss=0.101, over 21433.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.2818, pruned_loss=0.06844, over 4265180.28 frames. ], batch size: 509, lr: 3.76e-03, grad_scale: 32.0 2023-06-25 13:57:12,888 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1348272.0, ans=0.125 2023-06-25 13:57:20,045 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1348272.0, ans=0.0 2023-06-25 13:58:00,173 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1348392.0, ans=0.125 2023-06-25 13:58:59,631 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.719e+02 3.512e+02 4.278e+02 5.867e+02 1.075e+03, threshold=8.556e+02, percent-clipped=3.0 2023-06-25 13:58:59,663 INFO [train.py:996] (0/4) Epoch 8, batch 11300, loss[loss=0.1869, simple_loss=0.2636, pruned_loss=0.05508, over 21453.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.2832, pruned_loss=0.0685, over 4272712.38 frames. ], batch size: 211, lr: 3.76e-03, grad_scale: 32.0 2023-06-25 13:59:52,878 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1348692.0, ans=0.1 2023-06-25 14:00:49,644 INFO [train.py:996] (0/4) Epoch 8, batch 11350, loss[loss=0.2456, simple_loss=0.3216, pruned_loss=0.08481, over 21746.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.2851, pruned_loss=0.06806, over 4269197.06 frames. ], batch size: 124, lr: 3.76e-03, grad_scale: 16.0 2023-06-25 14:01:00,843 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1348872.0, ans=0.0 2023-06-25 14:01:30,418 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.81 vs. limit=10.0 2023-06-25 14:01:33,283 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1348932.0, ans=0.125 2023-06-25 14:01:41,315 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.78 vs. limit=15.0 2023-06-25 14:01:55,164 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1348992.0, ans=0.2 2023-06-25 14:02:24,078 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=1349112.0, ans=10.0 2023-06-25 14:02:33,742 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.37 vs. limit=15.0 2023-06-25 14:02:41,717 INFO [train.py:996] (0/4) Epoch 8, batch 11400, loss[loss=0.2634, simple_loss=0.3379, pruned_loss=0.09447, over 21703.00 frames. ], tot_loss[loss=0.216, simple_loss=0.2914, pruned_loss=0.07032, over 4271042.69 frames. ], batch size: 441, lr: 3.76e-03, grad_scale: 16.0 2023-06-25 14:02:43,599 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.820e+02 3.968e+02 4.967e+02 6.707e+02 2.156e+03, threshold=9.935e+02, percent-clipped=13.0 2023-06-25 14:02:45,075 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.15 vs. limit=15.0 2023-06-25 14:02:56,630 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 14:03:10,650 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1349232.0, ans=0.125 2023-06-25 14:03:14,334 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1349232.0, ans=0.2 2023-06-25 14:03:47,853 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1349292.0, ans=0.1 2023-06-25 14:04:01,826 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1349352.0, ans=0.1 2023-06-25 14:04:24,551 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1349412.0, ans=0.0 2023-06-25 14:04:36,654 INFO [train.py:996] (0/4) Epoch 8, batch 11450, loss[loss=0.2449, simple_loss=0.3289, pruned_loss=0.08047, over 21600.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2913, pruned_loss=0.06868, over 4272900.64 frames. ], batch size: 414, lr: 3.76e-03, grad_scale: 16.0 2023-06-25 14:04:48,109 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1349472.0, ans=0.125 2023-06-25 14:05:35,389 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1349592.0, ans=0.125 2023-06-25 14:06:09,082 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1349712.0, ans=0.125 2023-06-25 14:06:33,212 INFO [train.py:996] (0/4) Epoch 8, batch 11500, loss[loss=0.1797, simple_loss=0.2697, pruned_loss=0.04485, over 21289.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2952, pruned_loss=0.07001, over 4274474.99 frames. ], batch size: 176, lr: 3.76e-03, grad_scale: 16.0 2023-06-25 14:06:34,725 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.573e+02 4.073e+02 4.904e+02 7.356e+02 1.531e+03, threshold=9.808e+02, percent-clipped=13.0 2023-06-25 14:07:04,119 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1349832.0, ans=0.2 2023-06-25 14:07:07,538 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1349832.0, ans=0.2 2023-06-25 14:07:54,479 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.64 vs. limit=15.0 2023-06-25 14:08:17,288 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1350012.0, ans=0.0 2023-06-25 14:08:30,982 INFO [train.py:996] (0/4) Epoch 8, batch 11550, loss[loss=0.2348, simple_loss=0.3361, pruned_loss=0.06679, over 21220.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.2998, pruned_loss=0.06988, over 4270913.47 frames. ], batch size: 548, lr: 3.76e-03, grad_scale: 16.0 2023-06-25 14:10:22,645 INFO [train.py:996] (0/4) Epoch 8, batch 11600, loss[loss=0.1928, simple_loss=0.2615, pruned_loss=0.06204, over 20710.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3134, pruned_loss=0.07177, over 4265976.53 frames. ], batch size: 607, lr: 3.76e-03, grad_scale: 32.0 2023-06-25 14:10:24,362 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.914e+02 4.338e+02 5.534e+02 7.509e+02 2.145e+03, threshold=1.107e+03, percent-clipped=20.0 2023-06-25 14:11:15,343 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.96 vs. limit=6.0 2023-06-25 14:12:12,229 INFO [train.py:996] (0/4) Epoch 8, batch 11650, loss[loss=0.2086, simple_loss=0.2885, pruned_loss=0.06435, over 21224.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3175, pruned_loss=0.07273, over 4266445.09 frames. ], batch size: 176, lr: 3.76e-03, grad_scale: 32.0 2023-06-25 14:12:50,831 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1350792.0, ans=0.035 2023-06-25 14:13:37,802 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.39 vs. limit=15.0 2023-06-25 14:13:55,111 INFO [train.py:996] (0/4) Epoch 8, batch 11700, loss[loss=0.1936, simple_loss=0.2604, pruned_loss=0.06343, over 21481.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.3092, pruned_loss=0.07169, over 4252800.29 frames. ], batch size: 212, lr: 3.76e-03, grad_scale: 16.0 2023-06-25 14:13:58,337 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.769e+02 3.697e+02 5.318e+02 8.205e+02 1.649e+03, threshold=1.064e+03, percent-clipped=10.0 2023-06-25 14:14:21,948 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1351032.0, ans=0.1 2023-06-25 14:14:53,504 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1351092.0, ans=0.0 2023-06-25 14:15:43,633 INFO [train.py:996] (0/4) Epoch 8, batch 11750, loss[loss=0.2287, simple_loss=0.3031, pruned_loss=0.07715, over 21722.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.3007, pruned_loss=0.07149, over 4263563.10 frames. ], batch size: 351, lr: 3.76e-03, grad_scale: 16.0 2023-06-25 14:16:14,615 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1351332.0, ans=0.125 2023-06-25 14:16:16,733 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1351332.0, ans=0.125 2023-06-25 14:16:51,473 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1351392.0, ans=0.125 2023-06-25 14:16:58,467 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1351452.0, ans=0.05 2023-06-25 14:17:40,838 INFO [train.py:996] (0/4) Epoch 8, batch 11800, loss[loss=0.2285, simple_loss=0.3162, pruned_loss=0.07037, over 21448.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.3024, pruned_loss=0.07366, over 4258124.16 frames. ], batch size: 211, lr: 3.76e-03, grad_scale: 16.0 2023-06-25 14:17:44,259 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.704e+02 3.917e+02 5.538e+02 7.967e+02 1.804e+03, threshold=1.108e+03, percent-clipped=14.0 2023-06-25 14:17:52,019 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1351572.0, ans=0.125 2023-06-25 14:19:16,454 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1351812.0, ans=0.125 2023-06-25 14:19:30,740 INFO [train.py:996] (0/4) Epoch 8, batch 11850, loss[loss=0.283, simple_loss=0.3554, pruned_loss=0.1053, over 21547.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.3042, pruned_loss=0.07269, over 4262328.50 frames. ], batch size: 507, lr: 3.76e-03, grad_scale: 16.0 2023-06-25 14:19:31,563 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1351872.0, ans=0.04949747468305833 2023-06-25 14:19:41,811 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 14:20:25,970 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.37 vs. limit=15.0 2023-06-25 14:20:27,313 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1351992.0, ans=0.125 2023-06-25 14:20:57,394 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1352112.0, ans=0.2 2023-06-25 14:20:57,469 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1352112.0, ans=0.125 2023-06-25 14:21:19,380 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 14:21:22,244 INFO [train.py:996] (0/4) Epoch 8, batch 11900, loss[loss=0.2401, simple_loss=0.3256, pruned_loss=0.07729, over 21617.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.3042, pruned_loss=0.07101, over 4262661.90 frames. ], batch size: 441, lr: 3.76e-03, grad_scale: 16.0 2023-06-25 14:21:25,801 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.770e+02 3.589e+02 4.714e+02 6.474e+02 1.333e+03, threshold=9.428e+02, percent-clipped=3.0 2023-06-25 14:22:32,719 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1352292.0, ans=0.125 2023-06-25 14:22:43,552 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1352352.0, ans=0.2 2023-06-25 14:23:10,825 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.47 vs. limit=15.0 2023-06-25 14:23:16,549 INFO [train.py:996] (0/4) Epoch 8, batch 11950, loss[loss=0.2291, simple_loss=0.3499, pruned_loss=0.05408, over 21194.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.3049, pruned_loss=0.06827, over 4261670.64 frames. ], batch size: 548, lr: 3.76e-03, grad_scale: 16.0 2023-06-25 14:23:28,073 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1352472.0, ans=0.0 2023-06-25 14:23:46,697 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1352532.0, ans=0.125 2023-06-25 14:24:02,832 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1352592.0, ans=0.125 2023-06-25 14:24:04,982 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.31 vs. limit=12.0 2023-06-25 14:24:26,581 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1352652.0, ans=0.1 2023-06-25 14:24:34,802 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.63 vs. limit=10.0 2023-06-25 14:24:35,830 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1352652.0, ans=0.125 2023-06-25 14:24:37,862 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1352652.0, ans=0.125 2023-06-25 14:24:59,857 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1352712.0, ans=0.125 2023-06-25 14:25:06,501 INFO [train.py:996] (0/4) Epoch 8, batch 12000, loss[loss=0.2287, simple_loss=0.2903, pruned_loss=0.08355, over 21840.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.2974, pruned_loss=0.06584, over 4252156.76 frames. ], batch size: 98, lr: 3.76e-03, grad_scale: 32.0 2023-06-25 14:25:06,502 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-25 14:25:31,291 INFO [train.py:1028] (0/4) Epoch 8, validation: loss=0.2626, simple_loss=0.3537, pruned_loss=0.08577, over 1796401.00 frames. 2023-06-25 14:25:31,293 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23714MB 2023-06-25 14:25:34,822 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.287e+02 3.581e+02 4.444e+02 6.606e+02 1.302e+03, threshold=8.887e+02, percent-clipped=8.0 2023-06-25 14:26:14,721 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1352892.0, ans=0.125 2023-06-25 14:26:16,824 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1352892.0, ans=0.1 2023-06-25 14:26:21,817 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1352892.0, ans=0.125 2023-06-25 14:26:37,757 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1352952.0, ans=0.125 2023-06-25 14:26:46,500 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1352952.0, ans=0.125 2023-06-25 14:27:01,776 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1353012.0, ans=0.125 2023-06-25 14:27:08,712 INFO [train.py:996] (0/4) Epoch 8, batch 12050, loss[loss=0.2104, simple_loss=0.2714, pruned_loss=0.0747, over 21399.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2938, pruned_loss=0.067, over 4259925.62 frames. ], batch size: 177, lr: 3.76e-03, grad_scale: 32.0 2023-06-25 14:27:09,402 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1353072.0, ans=0.07 2023-06-25 14:27:43,293 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1353132.0, ans=0.2 2023-06-25 14:27:54,592 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.46 vs. limit=10.0 2023-06-25 14:28:41,527 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1353312.0, ans=0.2 2023-06-25 14:28:43,429 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1353312.0, ans=0.125 2023-06-25 14:29:10,837 INFO [train.py:996] (0/4) Epoch 8, batch 12100, loss[loss=0.2681, simple_loss=0.3864, pruned_loss=0.0749, over 19748.00 frames. ], tot_loss[loss=0.2219, simple_loss=0.3024, pruned_loss=0.07065, over 4260110.38 frames. ], batch size: 702, lr: 3.76e-03, grad_scale: 32.0 2023-06-25 14:29:14,308 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.733e+02 4.401e+02 6.036e+02 8.453e+02 2.254e+03, threshold=1.207e+03, percent-clipped=23.0 2023-06-25 14:29:21,315 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.34 vs. limit=6.0 2023-06-25 14:30:07,783 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1353492.0, ans=0.125 2023-06-25 14:30:42,275 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 14:31:09,076 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1353672.0, ans=0.125 2023-06-25 14:31:09,971 INFO [train.py:996] (0/4) Epoch 8, batch 12150, loss[loss=0.248, simple_loss=0.3472, pruned_loss=0.07435, over 21699.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.3072, pruned_loss=0.07101, over 4262347.98 frames. ], batch size: 414, lr: 3.75e-03, grad_scale: 32.0 2023-06-25 14:31:43,606 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1353732.0, ans=0.0 2023-06-25 14:31:47,270 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1353732.0, ans=0.125 2023-06-25 14:32:06,093 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1353792.0, ans=0.0 2023-06-25 14:32:59,830 INFO [train.py:996] (0/4) Epoch 8, batch 12200, loss[loss=0.2119, simple_loss=0.2752, pruned_loss=0.07426, over 21803.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.3027, pruned_loss=0.07092, over 4256019.76 frames. ], batch size: 352, lr: 3.75e-03, grad_scale: 32.0 2023-06-25 14:33:03,414 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.861e+02 3.926e+02 5.745e+02 7.853e+02 1.417e+03, threshold=1.149e+03, percent-clipped=2.0 2023-06-25 14:33:25,496 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1354032.0, ans=0.125 2023-06-25 14:34:37,531 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1354212.0, ans=0.125 2023-06-25 14:34:47,638 INFO [train.py:996] (0/4) Epoch 8, batch 12250, loss[loss=0.2247, simple_loss=0.3054, pruned_loss=0.07205, over 20742.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.2935, pruned_loss=0.06785, over 4249176.64 frames. ], batch size: 611, lr: 3.75e-03, grad_scale: 16.0 2023-06-25 14:35:14,705 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1354332.0, ans=0.0 2023-06-25 14:35:34,332 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1354392.0, ans=0.1 2023-06-25 14:35:41,090 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1354392.0, ans=0.0 2023-06-25 14:36:14,345 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1354512.0, ans=0.125 2023-06-25 14:36:36,591 INFO [train.py:996] (0/4) Epoch 8, batch 12300, loss[loss=0.2024, simple_loss=0.2891, pruned_loss=0.05789, over 19924.00 frames. ], tot_loss[loss=0.2062, simple_loss=0.286, pruned_loss=0.06318, over 4246926.37 frames. ], batch size: 704, lr: 3.75e-03, grad_scale: 16.0 2023-06-25 14:36:41,781 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.141e+02 3.509e+02 4.835e+02 7.096e+02 1.534e+03, threshold=9.669e+02, percent-clipped=2.0 2023-06-25 14:36:56,959 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1354572.0, ans=0.125 2023-06-25 14:36:58,767 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1354632.0, ans=0.125 2023-06-25 14:37:16,806 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1354692.0, ans=0.0 2023-06-25 14:38:25,401 INFO [train.py:996] (0/4) Epoch 8, batch 12350, loss[loss=0.2066, simple_loss=0.3332, pruned_loss=0.04, over 19880.00 frames. ], tot_loss[loss=0.2083, simple_loss=0.2894, pruned_loss=0.06364, over 4247086.58 frames. ], batch size: 702, lr: 3.75e-03, grad_scale: 16.0 2023-06-25 14:38:31,277 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1354872.0, ans=0.125 2023-06-25 14:38:50,174 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1354932.0, ans=0.125 2023-06-25 14:39:12,379 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1354992.0, ans=0.0 2023-06-25 14:40:12,712 INFO [train.py:996] (0/4) Epoch 8, batch 12400, loss[loss=0.2206, simple_loss=0.29, pruned_loss=0.07559, over 21883.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.292, pruned_loss=0.06678, over 4257131.44 frames. ], batch size: 351, lr: 3.75e-03, grad_scale: 32.0 2023-06-25 14:40:17,817 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.711e+02 4.388e+02 6.020e+02 7.604e+02 1.312e+03, threshold=1.204e+03, percent-clipped=10.0 2023-06-25 14:40:29,101 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1355172.0, ans=0.125 2023-06-25 14:40:31,062 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1355172.0, ans=0.035 2023-06-25 14:40:35,944 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1355232.0, ans=0.125 2023-06-25 14:40:37,846 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1355232.0, ans=0.2 2023-06-25 14:42:04,065 INFO [train.py:996] (0/4) Epoch 8, batch 12450, loss[loss=0.1694, simple_loss=0.2081, pruned_loss=0.06536, over 20067.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.2947, pruned_loss=0.06942, over 4262394.19 frames. ], batch size: 703, lr: 3.75e-03, grad_scale: 16.0 2023-06-25 14:42:29,093 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1355532.0, ans=0.125 2023-06-25 14:42:36,082 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1355532.0, ans=0.1 2023-06-25 14:42:41,537 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1355592.0, ans=0.2 2023-06-25 14:43:26,505 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1355652.0, ans=0.125 2023-06-25 14:43:55,688 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.99 vs. limit=10.0 2023-06-25 14:43:55,930 INFO [train.py:996] (0/4) Epoch 8, batch 12500, loss[loss=0.253, simple_loss=0.3487, pruned_loss=0.07865, over 21615.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.3039, pruned_loss=0.07187, over 4265577.10 frames. ], batch size: 230, lr: 3.75e-03, grad_scale: 16.0 2023-06-25 14:44:02,979 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.309e+02 4.292e+02 5.906e+02 9.269e+02 3.047e+03, threshold=1.181e+03, percent-clipped=14.0 2023-06-25 14:45:09,485 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1355952.0, ans=0.125 2023-06-25 14:45:16,886 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1355952.0, ans=0.125 2023-06-25 14:45:42,702 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1356012.0, ans=0.125 2023-06-25 14:45:47,022 INFO [train.py:996] (0/4) Epoch 8, batch 12550, loss[loss=0.2291, simple_loss=0.3158, pruned_loss=0.07122, over 21837.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3103, pruned_loss=0.07548, over 4270794.09 frames. ], batch size: 124, lr: 3.75e-03, grad_scale: 16.0 2023-06-25 14:45:47,633 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1356072.0, ans=0.2 2023-06-25 14:46:38,119 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1356192.0, ans=0.0 2023-06-25 14:47:11,964 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1356252.0, ans=0.2 2023-06-25 14:47:42,237 INFO [train.py:996] (0/4) Epoch 8, batch 12600, loss[loss=0.1948, simple_loss=0.2855, pruned_loss=0.05209, over 21616.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.3091, pruned_loss=0.07284, over 4272491.36 frames. ], batch size: 230, lr: 3.75e-03, grad_scale: 16.0 2023-06-25 14:47:48,467 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.38 vs. limit=15.0 2023-06-25 14:47:48,696 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.799e+02 4.195e+02 5.786e+02 8.769e+02 1.751e+03, threshold=1.157e+03, percent-clipped=8.0 2023-06-25 14:48:13,661 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1356432.0, ans=0.0 2023-06-25 14:49:03,495 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1356552.0, ans=0.0 2023-06-25 14:49:19,826 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.49 vs. limit=15.0 2023-06-25 14:49:23,550 INFO [train.py:996] (0/4) Epoch 8, batch 12650, loss[loss=0.2148, simple_loss=0.2832, pruned_loss=0.07323, over 21933.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.304, pruned_loss=0.06946, over 4273937.09 frames. ], batch size: 316, lr: 3.75e-03, grad_scale: 16.0 2023-06-25 14:50:44,878 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1356852.0, ans=0.95 2023-06-25 14:50:56,372 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.67 vs. limit=15.0 2023-06-25 14:51:15,159 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1356912.0, ans=0.0 2023-06-25 14:51:19,762 INFO [train.py:996] (0/4) Epoch 8, batch 12700, loss[loss=0.2143, simple_loss=0.2879, pruned_loss=0.07032, over 21844.00 frames. ], tot_loss[loss=0.224, simple_loss=0.3036, pruned_loss=0.07219, over 4276599.70 frames. ], batch size: 247, lr: 3.75e-03, grad_scale: 16.0 2023-06-25 14:51:32,485 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.627e+02 4.265e+02 5.595e+02 7.381e+02 1.572e+03, threshold=1.119e+03, percent-clipped=3.0 2023-06-25 14:51:40,423 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.72 vs. limit=6.0 2023-06-25 14:51:55,783 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1357032.0, ans=0.2 2023-06-25 14:52:11,072 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1357092.0, ans=0.0 2023-06-25 14:52:30,079 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 14:53:02,657 INFO [train.py:996] (0/4) Epoch 8, batch 12750, loss[loss=0.2091, simple_loss=0.291, pruned_loss=0.06362, over 21694.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.3042, pruned_loss=0.07178, over 4277023.31 frames. ], batch size: 263, lr: 3.75e-03, grad_scale: 16.0 2023-06-25 14:53:26,080 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1357332.0, ans=0.125 2023-06-25 14:53:55,894 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1357392.0, ans=0.2 2023-06-25 14:54:06,371 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1357452.0, ans=0.2 2023-06-25 14:54:10,024 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1357452.0, ans=0.125 2023-06-25 14:54:15,499 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1357452.0, ans=0.04949747468305833 2023-06-25 14:54:19,475 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.73 vs. limit=15.0 2023-06-25 14:54:57,129 INFO [train.py:996] (0/4) Epoch 8, batch 12800, loss[loss=0.2331, simple_loss=0.3026, pruned_loss=0.08182, over 21756.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.303, pruned_loss=0.0719, over 4283561.42 frames. ], batch size: 112, lr: 3.75e-03, grad_scale: 32.0 2023-06-25 14:55:02,999 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1357572.0, ans=0.0 2023-06-25 14:55:04,030 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.881e+02 3.698e+02 4.519e+02 5.409e+02 8.581e+02, threshold=9.039e+02, percent-clipped=0.0 2023-06-25 14:56:04,446 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.87 vs. limit=15.0 2023-06-25 14:56:14,614 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1357752.0, ans=0.125 2023-06-25 14:56:20,039 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1357812.0, ans=0.125 2023-06-25 14:56:47,892 INFO [train.py:996] (0/4) Epoch 8, batch 12850, loss[loss=0.1973, simple_loss=0.2982, pruned_loss=0.04826, over 21887.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.3043, pruned_loss=0.07315, over 4285414.66 frames. ], batch size: 316, lr: 3.75e-03, grad_scale: 32.0 2023-06-25 14:58:06,530 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.92 vs. limit=6.0 2023-06-25 14:58:40,136 INFO [train.py:996] (0/4) Epoch 8, batch 12900, loss[loss=0.2031, simple_loss=0.2858, pruned_loss=0.06023, over 21781.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.3028, pruned_loss=0.0701, over 4287929.99 frames. ], batch size: 282, lr: 3.75e-03, grad_scale: 32.0 2023-06-25 14:58:46,434 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1358172.0, ans=0.0 2023-06-25 14:58:47,393 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.654e+02 3.588e+02 4.373e+02 7.155e+02 1.857e+03, threshold=8.745e+02, percent-clipped=14.0 2023-06-25 14:59:15,781 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1358292.0, ans=0.2 2023-06-25 14:59:26,316 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.44 vs. limit=15.0 2023-06-25 14:59:40,203 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.10 vs. limit=6.0 2023-06-25 14:59:49,915 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1358352.0, ans=0.125 2023-06-25 15:00:02,975 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.85 vs. limit=15.0 2023-06-25 15:00:02,976 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=1358412.0, ans=15.0 2023-06-25 15:00:24,729 INFO [train.py:996] (0/4) Epoch 8, batch 12950, loss[loss=0.2764, simple_loss=0.408, pruned_loss=0.07246, over 19736.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.3049, pruned_loss=0.06868, over 4280017.84 frames. ], batch size: 702, lr: 3.75e-03, grad_scale: 16.0 2023-06-25 15:01:22,851 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1358592.0, ans=0.025 2023-06-25 15:01:26,698 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1358592.0, ans=0.125 2023-06-25 15:01:33,926 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1358652.0, ans=0.125 2023-06-25 15:02:07,354 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.56 vs. limit=15.0 2023-06-25 15:02:14,922 INFO [train.py:996] (0/4) Epoch 8, batch 13000, loss[loss=0.2058, simple_loss=0.2846, pruned_loss=0.06353, over 21369.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.3068, pruned_loss=0.07001, over 4273239.76 frames. ], batch size: 211, lr: 3.75e-03, grad_scale: 16.0 2023-06-25 15:02:23,110 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.358e+02 3.843e+02 4.886e+02 6.754e+02 1.173e+03, threshold=9.772e+02, percent-clipped=9.0 2023-06-25 15:02:34,094 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1358832.0, ans=0.125 2023-06-25 15:03:09,144 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1358892.0, ans=0.125 2023-06-25 15:03:12,348 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1358892.0, ans=0.125 2023-06-25 15:03:54,687 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1359012.0, ans=0.125 2023-06-25 15:03:57,497 INFO [train.py:996] (0/4) Epoch 8, batch 13050, loss[loss=0.1932, simple_loss=0.2701, pruned_loss=0.05815, over 21284.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2998, pruned_loss=0.06727, over 4269691.57 frames. ], batch size: 159, lr: 3.75e-03, grad_scale: 16.0 2023-06-25 15:05:03,533 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1359252.0, ans=0.125 2023-06-25 15:05:29,645 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.70 vs. limit=12.0 2023-06-25 15:05:41,267 INFO [train.py:996] (0/4) Epoch 8, batch 13100, loss[loss=0.2045, simple_loss=0.2759, pruned_loss=0.06657, over 21002.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2984, pruned_loss=0.06675, over 4278034.83 frames. ], batch size: 608, lr: 3.75e-03, grad_scale: 16.0 2023-06-25 15:05:43,743 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1359372.0, ans=0.0 2023-06-25 15:05:50,178 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.856e+02 3.427e+02 4.465e+02 6.179e+02 1.477e+03, threshold=8.931e+02, percent-clipped=2.0 2023-06-25 15:05:58,490 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.56 vs. limit=15.0 2023-06-25 15:06:32,765 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1359492.0, ans=0.125 2023-06-25 15:06:52,078 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.39 vs. limit=15.0 2023-06-25 15:07:01,869 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1359552.0, ans=0.2 2023-06-25 15:07:07,300 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1359552.0, ans=0.125 2023-06-25 15:07:31,676 INFO [train.py:996] (0/4) Epoch 8, batch 13150, loss[loss=0.2433, simple_loss=0.3146, pruned_loss=0.08602, over 21619.00 frames. ], tot_loss[loss=0.22, simple_loss=0.3007, pruned_loss=0.0696, over 4271826.42 frames. ], batch size: 441, lr: 3.75e-03, grad_scale: 16.0 2023-06-25 15:07:39,970 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.58 vs. limit=22.5 2023-06-25 15:09:03,298 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1359912.0, ans=0.125 2023-06-25 15:09:07,685 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.91 vs. limit=15.0 2023-06-25 15:09:16,240 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.92 vs. limit=15.0 2023-06-25 15:09:27,385 INFO [train.py:996] (0/4) Epoch 8, batch 13200, loss[loss=0.1521, simple_loss=0.2014, pruned_loss=0.05136, over 17468.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.2985, pruned_loss=0.06936, over 4266125.02 frames. ], batch size: 61, lr: 3.75e-03, grad_scale: 32.0 2023-06-25 15:09:39,998 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1359972.0, ans=0.125 2023-06-25 15:09:46,137 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.744e+02 3.706e+02 4.388e+02 6.661e+02 1.084e+03, threshold=8.775e+02, percent-clipped=9.0 2023-06-25 15:10:09,111 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1360032.0, ans=0.125 2023-06-25 15:10:59,440 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1360212.0, ans=0.125 2023-06-25 15:11:21,332 INFO [train.py:996] (0/4) Epoch 8, batch 13250, loss[loss=0.2258, simple_loss=0.2964, pruned_loss=0.07755, over 21881.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.2977, pruned_loss=0.07062, over 4275871.38 frames. ], batch size: 316, lr: 3.75e-03, grad_scale: 16.0 2023-06-25 15:11:53,292 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=16.18 vs. limit=22.5 2023-06-25 15:12:57,753 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1360512.0, ans=0.09899494936611666 2023-06-25 15:13:05,078 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.27 vs. limit=6.0 2023-06-25 15:13:18,499 INFO [train.py:996] (0/4) Epoch 8, batch 13300, loss[loss=0.1989, simple_loss=0.3141, pruned_loss=0.04186, over 19765.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.3012, pruned_loss=0.07028, over 4272478.28 frames. ], batch size: 702, lr: 3.75e-03, grad_scale: 16.0 2023-06-25 15:13:34,394 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.430e+02 3.717e+02 5.105e+02 6.593e+02 1.654e+03, threshold=1.021e+03, percent-clipped=11.0 2023-06-25 15:13:35,692 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.84 vs. limit=12.0 2023-06-25 15:14:04,233 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1360692.0, ans=0.0 2023-06-25 15:15:00,486 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1360812.0, ans=0.025 2023-06-25 15:15:08,169 INFO [train.py:996] (0/4) Epoch 8, batch 13350, loss[loss=0.2505, simple_loss=0.339, pruned_loss=0.081, over 21616.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.3056, pruned_loss=0.07309, over 4276863.88 frames. ], batch size: 389, lr: 3.74e-03, grad_scale: 16.0 2023-06-25 15:16:01,964 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1360992.0, ans=0.125 2023-06-25 15:17:03,315 INFO [train.py:996] (0/4) Epoch 8, batch 13400, loss[loss=0.2344, simple_loss=0.3089, pruned_loss=0.07996, over 21209.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.3064, pruned_loss=0.07474, over 4277609.72 frames. ], batch size: 143, lr: 3.74e-03, grad_scale: 16.0 2023-06-25 15:17:03,807 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=1361172.0, ans=0.5 2023-06-25 15:17:13,988 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.122e+02 3.939e+02 4.986e+02 7.057e+02 1.760e+03, threshold=9.973e+02, percent-clipped=5.0 2023-06-25 15:17:20,372 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1361232.0, ans=0.125 2023-06-25 15:18:20,034 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1361352.0, ans=0.125 2023-06-25 15:18:52,664 INFO [train.py:996] (0/4) Epoch 8, batch 13450, loss[loss=0.2446, simple_loss=0.3206, pruned_loss=0.0843, over 21719.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3075, pruned_loss=0.07693, over 4284903.58 frames. ], batch size: 124, lr: 3.74e-03, grad_scale: 16.0 2023-06-25 15:19:26,846 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1361592.0, ans=0.0 2023-06-25 15:19:28,396 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1361592.0, ans=0.0 2023-06-25 15:19:57,779 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1361592.0, ans=0.2 2023-06-25 15:20:23,524 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1361712.0, ans=0.04949747468305833 2023-06-25 15:20:42,802 INFO [train.py:996] (0/4) Epoch 8, batch 13500, loss[loss=0.1748, simple_loss=0.2498, pruned_loss=0.04988, over 21748.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.2976, pruned_loss=0.07379, over 4281571.39 frames. ], batch size: 282, lr: 3.74e-03, grad_scale: 16.0 2023-06-25 15:20:53,701 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.704e+02 3.900e+02 4.940e+02 7.289e+02 1.559e+03, threshold=9.879e+02, percent-clipped=7.0 2023-06-25 15:20:59,770 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1361832.0, ans=0.1 2023-06-25 15:21:03,788 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1361832.0, ans=0.125 2023-06-25 15:21:09,182 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1361832.0, ans=0.125 2023-06-25 15:21:18,560 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1361832.0, ans=0.04949747468305833 2023-06-25 15:22:33,641 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff3.min_abs, batch_count=1362072.0, ans=0.2 2023-06-25 15:22:34,507 INFO [train.py:996] (0/4) Epoch 8, batch 13550, loss[loss=0.3405, simple_loss=0.4237, pruned_loss=0.1286, over 21455.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.3025, pruned_loss=0.07311, over 4275693.85 frames. ], batch size: 507, lr: 3.74e-03, grad_scale: 16.0 2023-06-25 15:22:44,419 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.53 vs. limit=15.0 2023-06-25 15:23:13,330 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1362132.0, ans=0.2 2023-06-25 15:23:13,459 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1362132.0, ans=0.125 2023-06-25 15:23:16,109 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.86 vs. limit=10.0 2023-06-25 15:24:18,238 INFO [train.py:996] (0/4) Epoch 8, batch 13600, loss[loss=0.2009, simple_loss=0.264, pruned_loss=0.06891, over 19985.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.3037, pruned_loss=0.07376, over 4278842.46 frames. ], batch size: 703, lr: 3.74e-03, grad_scale: 32.0 2023-06-25 15:24:28,504 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.890e+02 3.859e+02 5.232e+02 7.287e+02 1.567e+03, threshold=1.046e+03, percent-clipped=12.0 2023-06-25 15:25:12,231 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.28 vs. limit=15.0 2023-06-25 15:25:22,728 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1362492.0, ans=0.0 2023-06-25 15:25:24,322 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1362492.0, ans=0.1 2023-06-25 15:26:01,146 INFO [train.py:996] (0/4) Epoch 8, batch 13650, loss[loss=0.1956, simple_loss=0.2627, pruned_loss=0.06429, over 21521.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.3, pruned_loss=0.07064, over 4281338.07 frames. ], batch size: 441, lr: 3.74e-03, grad_scale: 16.0 2023-06-25 15:26:05,818 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.33 vs. limit=15.0 2023-06-25 15:26:14,439 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.26 vs. limit=15.0 2023-06-25 15:26:32,007 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.94 vs. limit=15.0 2023-06-25 15:26:58,694 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1362792.0, ans=0.0 2023-06-25 15:27:50,065 INFO [train.py:996] (0/4) Epoch 8, batch 13700, loss[loss=0.1545, simple_loss=0.2031, pruned_loss=0.05289, over 17377.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.2956, pruned_loss=0.07053, over 4277447.80 frames. ], batch size: 66, lr: 3.74e-03, grad_scale: 16.0 2023-06-25 15:28:08,795 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.733e+02 3.641e+02 4.705e+02 7.070e+02 1.116e+03, threshold=9.410e+02, percent-clipped=4.0 2023-06-25 15:28:38,838 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1363032.0, ans=0.2 2023-06-25 15:29:32,354 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.68 vs. limit=15.0 2023-06-25 15:29:46,985 INFO [train.py:996] (0/4) Epoch 8, batch 13750, loss[loss=0.2239, simple_loss=0.3054, pruned_loss=0.07119, over 21657.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.2956, pruned_loss=0.07069, over 4279276.44 frames. ], batch size: 414, lr: 3.74e-03, grad_scale: 16.0 2023-06-25 15:30:52,111 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.29 vs. limit=15.0 2023-06-25 15:31:06,154 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1363452.0, ans=0.125 2023-06-25 15:31:42,860 INFO [train.py:996] (0/4) Epoch 8, batch 13800, loss[loss=0.2242, simple_loss=0.3274, pruned_loss=0.06052, over 21664.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.2995, pruned_loss=0.07037, over 4269158.32 frames. ], batch size: 247, lr: 3.74e-03, grad_scale: 16.0 2023-06-25 15:32:00,800 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.906e+02 4.517e+02 6.756e+02 9.995e+02 2.111e+03, threshold=1.351e+03, percent-clipped=26.0 2023-06-25 15:32:12,839 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.17 vs. limit=15.0 2023-06-25 15:32:13,802 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1363632.0, ans=0.125 2023-06-25 15:32:21,013 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1363632.0, ans=0.0 2023-06-25 15:32:40,306 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1363752.0, ans=0.125 2023-06-25 15:33:17,846 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1363812.0, ans=0.1 2023-06-25 15:33:33,393 INFO [train.py:996] (0/4) Epoch 8, batch 13850, loss[loss=0.2259, simple_loss=0.3437, pruned_loss=0.05408, over 20748.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.3042, pruned_loss=0.07052, over 4263281.70 frames. ], batch size: 607, lr: 3.74e-03, grad_scale: 16.0 2023-06-25 15:33:43,200 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1363872.0, ans=0.0 2023-06-25 15:34:14,796 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 15:35:20,937 INFO [train.py:996] (0/4) Epoch 8, batch 13900, loss[loss=0.222, simple_loss=0.2892, pruned_loss=0.0774, over 21825.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.3075, pruned_loss=0.07313, over 4267521.36 frames. ], batch size: 247, lr: 3.74e-03, grad_scale: 16.0 2023-06-25 15:35:33,171 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.067e+02 4.054e+02 4.959e+02 6.399e+02 1.364e+03, threshold=9.918e+02, percent-clipped=1.0 2023-06-25 15:35:42,493 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1364232.0, ans=0.125 2023-06-25 15:36:24,448 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1364352.0, ans=0.125 2023-06-25 15:36:31,089 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1364352.0, ans=0.0 2023-06-25 15:37:09,434 INFO [train.py:996] (0/4) Epoch 8, batch 13950, loss[loss=0.2228, simple_loss=0.2968, pruned_loss=0.07441, over 21800.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.3069, pruned_loss=0.0748, over 4281264.86 frames. ], batch size: 298, lr: 3.74e-03, grad_scale: 16.0 2023-06-25 15:37:28,025 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.28 vs. limit=12.0 2023-06-25 15:37:28,115 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.09 vs. limit=22.5 2023-06-25 15:38:03,955 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1364592.0, ans=0.125 2023-06-25 15:38:57,952 INFO [train.py:996] (0/4) Epoch 8, batch 14000, loss[loss=0.1673, simple_loss=0.2431, pruned_loss=0.04575, over 21210.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.3038, pruned_loss=0.07273, over 4281668.93 frames. ], batch size: 143, lr: 3.74e-03, grad_scale: 32.0 2023-06-25 15:39:09,930 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.689e+02 3.751e+02 4.894e+02 7.186e+02 1.368e+03, threshold=9.787e+02, percent-clipped=13.0 2023-06-25 15:39:12,536 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 15:39:16,006 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1364832.0, ans=0.125 2023-06-25 15:39:24,682 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1364832.0, ans=0.5 2023-06-25 15:40:00,860 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_na.min_abs, batch_count=1364952.0, ans=0.02 2023-06-25 15:40:19,477 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.90 vs. limit=15.0 2023-06-25 15:40:20,560 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1364952.0, ans=0.0 2023-06-25 15:40:28,821 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1365012.0, ans=0.125 2023-06-25 15:40:45,633 INFO [train.py:996] (0/4) Epoch 8, batch 14050, loss[loss=0.1999, simple_loss=0.2595, pruned_loss=0.07022, over 21546.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2971, pruned_loss=0.06902, over 4271363.11 frames. ], batch size: 132, lr: 3.74e-03, grad_scale: 32.0 2023-06-25 15:40:47,934 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1365072.0, ans=0.0 2023-06-25 15:41:12,480 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1365132.0, ans=0.125 2023-06-25 15:41:31,965 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1365192.0, ans=0.2 2023-06-25 15:42:12,632 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1365312.0, ans=0.125 2023-06-25 15:42:33,639 INFO [train.py:996] (0/4) Epoch 8, batch 14100, loss[loss=0.2432, simple_loss=0.316, pruned_loss=0.08519, over 21931.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2919, pruned_loss=0.06853, over 4261818.49 frames. ], batch size: 372, lr: 3.74e-03, grad_scale: 16.0 2023-06-25 15:42:41,454 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1365372.0, ans=0.2 2023-06-25 15:42:47,589 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.656e+02 3.476e+02 4.443e+02 5.620e+02 1.211e+03, threshold=8.886e+02, percent-clipped=2.0 2023-06-25 15:42:50,012 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1365432.0, ans=0.0 2023-06-25 15:42:51,701 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1365432.0, ans=0.07 2023-06-25 15:43:02,746 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.23 vs. limit=15.0 2023-06-25 15:43:32,890 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1365492.0, ans=0.1 2023-06-25 15:44:19,927 INFO [train.py:996] (0/4) Epoch 8, batch 14150, loss[loss=0.2444, simple_loss=0.3729, pruned_loss=0.05801, over 20802.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.296, pruned_loss=0.06931, over 4256239.71 frames. ], batch size: 607, lr: 3.74e-03, grad_scale: 16.0 2023-06-25 15:44:25,595 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1365672.0, ans=0.125 2023-06-25 15:45:22,279 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1365852.0, ans=0.2 2023-06-25 15:45:36,846 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.32 vs. limit=22.5 2023-06-25 15:46:01,155 INFO [train.py:996] (0/4) Epoch 8, batch 14200, loss[loss=0.2243, simple_loss=0.2856, pruned_loss=0.08153, over 21685.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2946, pruned_loss=0.0675, over 4260800.83 frames. ], batch size: 332, lr: 3.74e-03, grad_scale: 16.0 2023-06-25 15:46:20,192 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.681e+02 4.879e+02 7.691e+02 1.070e+03 2.190e+03, threshold=1.538e+03, percent-clipped=38.0 2023-06-25 15:46:24,370 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1366032.0, ans=0.0 2023-06-25 15:46:34,853 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1366032.0, ans=0.1 2023-06-25 15:47:09,071 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1366152.0, ans=0.125 2023-06-25 15:47:31,045 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.89 vs. limit=22.5 2023-06-25 15:47:33,773 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 15:47:49,401 INFO [train.py:996] (0/4) Epoch 8, batch 14250, loss[loss=0.2078, simple_loss=0.2779, pruned_loss=0.06887, over 21766.00 frames. ], tot_loss[loss=0.2133, simple_loss=0.2908, pruned_loss=0.06794, over 4252036.97 frames. ], batch size: 112, lr: 3.74e-03, grad_scale: 16.0 2023-06-25 15:48:00,694 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1366272.0, ans=0.0 2023-06-25 15:48:13,044 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1366332.0, ans=0.2 2023-06-25 15:48:16,352 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1366332.0, ans=0.0 2023-06-25 15:48:38,161 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1366392.0, ans=0.125 2023-06-25 15:49:17,558 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1366452.0, ans=0.1 2023-06-25 15:49:39,579 INFO [train.py:996] (0/4) Epoch 8, batch 14300, loss[loss=0.2786, simple_loss=0.3849, pruned_loss=0.08621, over 21240.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2926, pruned_loss=0.0677, over 4252025.80 frames. ], batch size: 549, lr: 3.74e-03, grad_scale: 16.0 2023-06-25 15:49:59,651 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.539e+02 3.382e+02 4.720e+02 7.552e+02 1.673e+03, threshold=9.439e+02, percent-clipped=2.0 2023-06-25 15:51:23,209 INFO [train.py:996] (0/4) Epoch 8, batch 14350, loss[loss=0.2284, simple_loss=0.3076, pruned_loss=0.07465, over 21731.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.3, pruned_loss=0.06963, over 4253915.25 frames. ], batch size: 389, lr: 3.74e-03, grad_scale: 16.0 2023-06-25 15:51:38,376 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.62 vs. limit=22.5 2023-06-25 15:52:01,226 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1366932.0, ans=0.125 2023-06-25 15:52:03,034 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1366992.0, ans=0.125 2023-06-25 15:52:34,606 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1367052.0, ans=0.0 2023-06-25 15:52:48,418 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1367052.0, ans=0.125 2023-06-25 15:52:51,674 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1367052.0, ans=0.1 2023-06-25 15:53:16,579 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1367172.0, ans=0.2 2023-06-25 15:53:17,526 INFO [train.py:996] (0/4) Epoch 8, batch 14400, loss[loss=0.2141, simple_loss=0.2754, pruned_loss=0.07641, over 21571.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.2972, pruned_loss=0.06968, over 4259006.40 frames. ], batch size: 441, lr: 3.74e-03, grad_scale: 32.0 2023-06-25 15:53:30,787 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.862e+02 3.850e+02 4.891e+02 6.324e+02 1.594e+03, threshold=9.783e+02, percent-clipped=6.0 2023-06-25 15:54:37,392 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1367412.0, ans=0.125 2023-06-25 15:54:53,564 INFO [train.py:996] (0/4) Epoch 8, batch 14450, loss[loss=0.1988, simple_loss=0.2669, pruned_loss=0.06535, over 21800.00 frames. ], tot_loss[loss=0.2156, simple_loss=0.292, pruned_loss=0.06958, over 4249519.43 frames. ], batch size: 351, lr: 3.74e-03, grad_scale: 32.0 2023-06-25 15:55:35,456 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1367592.0, ans=0.125 2023-06-25 15:56:08,274 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1367652.0, ans=0.0 2023-06-25 15:56:10,074 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 15:56:34,734 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1367712.0, ans=0.125 2023-06-25 15:56:40,151 INFO [train.py:996] (0/4) Epoch 8, batch 14500, loss[loss=0.2573, simple_loss=0.3287, pruned_loss=0.0929, over 21413.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.288, pruned_loss=0.06969, over 4246631.73 frames. ], batch size: 471, lr: 3.74e-03, grad_scale: 16.0 2023-06-25 15:56:59,531 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1367772.0, ans=0.125 2023-06-25 15:57:02,149 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.716e+02 3.452e+02 4.183e+02 6.174e+02 1.088e+03, threshold=8.366e+02, percent-clipped=1.0 2023-06-25 15:57:04,363 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1367832.0, ans=0.125 2023-06-25 15:57:46,525 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1367892.0, ans=0.0 2023-06-25 15:57:48,035 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1367892.0, ans=0.0 2023-06-25 15:58:04,074 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-228000.pt 2023-06-25 15:58:33,920 INFO [train.py:996] (0/4) Epoch 8, batch 14550, loss[loss=0.2579, simple_loss=0.3317, pruned_loss=0.09205, over 21268.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2914, pruned_loss=0.07051, over 4249890.22 frames. ], batch size: 176, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 15:58:57,101 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1368132.0, ans=0.125 2023-06-25 15:58:57,185 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1368132.0, ans=0.0 2023-06-25 15:59:52,817 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1368252.0, ans=0.025 2023-06-25 16:00:22,848 INFO [train.py:996] (0/4) Epoch 8, batch 14600, loss[loss=0.2137, simple_loss=0.3102, pruned_loss=0.05865, over 21394.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.2986, pruned_loss=0.07352, over 4262576.62 frames. ], batch size: 211, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 16:00:33,405 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1368372.0, ans=0.0 2023-06-25 16:00:38,180 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.261e+02 4.717e+02 6.049e+02 8.556e+02 1.756e+03, threshold=1.210e+03, percent-clipped=27.0 2023-06-25 16:00:52,713 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 16:02:10,909 INFO [train.py:996] (0/4) Epoch 8, batch 14650, loss[loss=0.2373, simple_loss=0.3231, pruned_loss=0.07577, over 21658.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.3011, pruned_loss=0.07252, over 4259789.83 frames. ], batch size: 389, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 16:02:30,510 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1368732.0, ans=0.125 2023-06-25 16:02:41,386 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1368732.0, ans=0.0 2023-06-25 16:03:20,653 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1368852.0, ans=0.125 2023-06-25 16:03:58,465 INFO [train.py:996] (0/4) Epoch 8, batch 14700, loss[loss=0.2952, simple_loss=0.3804, pruned_loss=0.105, over 21517.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2959, pruned_loss=0.06822, over 4250739.76 frames. ], batch size: 508, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 16:04:14,436 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.383e+02 3.677e+02 4.958e+02 7.109e+02 1.155e+03, threshold=9.917e+02, percent-clipped=0.0 2023-06-25 16:05:45,567 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1369212.0, ans=0.0 2023-06-25 16:05:50,291 INFO [train.py:996] (0/4) Epoch 8, batch 14750, loss[loss=0.2305, simple_loss=0.3104, pruned_loss=0.07528, over 21623.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.3004, pruned_loss=0.07106, over 4251325.06 frames. ], batch size: 230, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 16:06:35,241 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1369332.0, ans=0.125 2023-06-25 16:06:44,187 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1369392.0, ans=0.1 2023-06-25 16:06:51,122 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1369392.0, ans=0.2 2023-06-25 16:07:09,716 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1369452.0, ans=0.125 2023-06-25 16:07:47,184 INFO [train.py:996] (0/4) Epoch 8, batch 14800, loss[loss=0.2114, simple_loss=0.28, pruned_loss=0.07137, over 21629.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3106, pruned_loss=0.07545, over 4250290.96 frames. ], batch size: 298, lr: 3.73e-03, grad_scale: 32.0 2023-06-25 16:08:01,466 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1369572.0, ans=0.0 2023-06-25 16:08:02,884 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1369572.0, ans=10.0 2023-06-25 16:08:12,802 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.189e+02 4.811e+02 6.847e+02 1.023e+03 2.171e+03, threshold=1.369e+03, percent-clipped=26.0 2023-06-25 16:08:24,053 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1369632.0, ans=0.125 2023-06-25 16:08:49,734 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1369692.0, ans=0.125 2023-06-25 16:08:51,908 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.81 vs. limit=22.5 2023-06-25 16:09:19,334 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.86 vs. limit=15.0 2023-06-25 16:09:43,094 INFO [train.py:996] (0/4) Epoch 8, batch 14850, loss[loss=0.1968, simple_loss=0.2652, pruned_loss=0.06415, over 21422.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3057, pruned_loss=0.07548, over 4255865.56 frames. ], batch size: 211, lr: 3.73e-03, grad_scale: 32.0 2023-06-25 16:10:00,399 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.35 vs. limit=15.0 2023-06-25 16:11:15,674 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1370112.0, ans=0.2 2023-06-25 16:11:39,845 INFO [train.py:996] (0/4) Epoch 8, batch 14900, loss[loss=0.3355, simple_loss=0.4029, pruned_loss=0.134, over 21429.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3083, pruned_loss=0.07786, over 4259162.62 frames. ], batch size: 507, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 16:11:45,956 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1370172.0, ans=0.0 2023-06-25 16:11:57,472 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.090e+02 4.206e+02 5.469e+02 8.347e+02 1.577e+03, threshold=1.094e+03, percent-clipped=2.0 2023-06-25 16:12:58,858 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1370352.0, ans=0.07 2023-06-25 16:12:58,880 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1370352.0, ans=0.0 2023-06-25 16:13:26,337 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1370412.0, ans=0.125 2023-06-25 16:13:30,837 INFO [train.py:996] (0/4) Epoch 8, batch 14950, loss[loss=0.2311, simple_loss=0.3196, pruned_loss=0.07131, over 21263.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.309, pruned_loss=0.07777, over 4262635.15 frames. ], batch size: 549, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 16:14:02,651 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.97 vs. limit=6.0 2023-06-25 16:14:30,290 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1370592.0, ans=0.125 2023-06-25 16:14:41,047 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1370652.0, ans=0.0 2023-06-25 16:14:41,126 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1370652.0, ans=0.2 2023-06-25 16:15:08,569 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1370712.0, ans=0.0 2023-06-25 16:15:19,934 INFO [train.py:996] (0/4) Epoch 8, batch 15000, loss[loss=0.2384, simple_loss=0.3072, pruned_loss=0.08482, over 21822.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.3103, pruned_loss=0.0785, over 4265190.06 frames. ], batch size: 351, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 16:15:19,936 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-25 16:15:40,719 INFO [train.py:1028] (0/4) Epoch 8, validation: loss=0.2554, simple_loss=0.3473, pruned_loss=0.08173, over 1796401.00 frames. 2023-06-25 16:15:40,720 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23714MB 2023-06-25 16:15:58,827 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.944e+02 3.850e+02 4.977e+02 6.696e+02 1.113e+03, threshold=9.953e+02, percent-clipped=2.0 2023-06-25 16:16:00,255 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.49 vs. limit=5.0 2023-06-25 16:16:19,840 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.96 vs. limit=15.0 2023-06-25 16:17:11,379 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1371012.0, ans=0.0 2023-06-25 16:17:13,130 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1371012.0, ans=0.0 2023-06-25 16:17:30,927 INFO [train.py:996] (0/4) Epoch 8, batch 15050, loss[loss=0.2256, simple_loss=0.3242, pruned_loss=0.06354, over 20764.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3108, pruned_loss=0.0786, over 4260084.15 frames. ], batch size: 608, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 16:18:06,261 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1371132.0, ans=0.125 2023-06-25 16:18:15,210 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1371192.0, ans=0.125 2023-06-25 16:18:17,232 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.84 vs. limit=15.0 2023-06-25 16:18:45,383 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1371252.0, ans=0.1 2023-06-25 16:19:20,748 INFO [train.py:996] (0/4) Epoch 8, batch 15100, loss[loss=0.2368, simple_loss=0.3116, pruned_loss=0.081, over 21827.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.315, pruned_loss=0.07822, over 4267760.29 frames. ], batch size: 282, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 16:19:29,268 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.19 vs. limit=15.0 2023-06-25 16:19:43,658 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.981e+02 4.480e+02 6.447e+02 8.808e+02 1.442e+03, threshold=1.289e+03, percent-clipped=16.0 2023-06-25 16:20:15,960 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1371492.0, ans=0.125 2023-06-25 16:21:09,586 INFO [train.py:996] (0/4) Epoch 8, batch 15150, loss[loss=0.1916, simple_loss=0.2533, pruned_loss=0.065, over 21564.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.311, pruned_loss=0.07802, over 4254171.27 frames. ], batch size: 231, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 16:21:10,390 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1371672.0, ans=0.125 2023-06-25 16:21:52,557 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1371732.0, ans=0.125 2023-06-25 16:22:46,166 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1371912.0, ans=0.0 2023-06-25 16:22:57,838 INFO [train.py:996] (0/4) Epoch 8, batch 15200, loss[loss=0.1663, simple_loss=0.2561, pruned_loss=0.03826, over 21593.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.3021, pruned_loss=0.07343, over 4258913.98 frames. ], batch size: 263, lr: 3.73e-03, grad_scale: 32.0 2023-06-25 16:23:01,084 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.63 vs. limit=15.0 2023-06-25 16:23:25,366 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1372032.0, ans=0.1 2023-06-25 16:23:26,361 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.586e+02 3.888e+02 5.742e+02 8.749e+02 1.820e+03, threshold=1.148e+03, percent-clipped=6.0 2023-06-25 16:23:45,371 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.72 vs. limit=15.0 2023-06-25 16:23:46,171 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 16:24:00,179 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1372092.0, ans=0.2 2023-06-25 16:24:52,738 INFO [train.py:996] (0/4) Epoch 8, batch 15250, loss[loss=0.253, simple_loss=0.3482, pruned_loss=0.07888, over 19646.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2965, pruned_loss=0.07207, over 4247411.19 frames. ], batch size: 703, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 16:25:35,151 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1372332.0, ans=0.2 2023-06-25 16:26:06,069 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.86 vs. limit=15.0 2023-06-25 16:26:07,425 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1372452.0, ans=0.0 2023-06-25 16:26:36,349 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1372512.0, ans=0.1 2023-06-25 16:26:36,395 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1372512.0, ans=0.95 2023-06-25 16:26:38,017 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1372512.0, ans=0.2 2023-06-25 16:26:48,285 INFO [train.py:996] (0/4) Epoch 8, batch 15300, loss[loss=0.2582, simple_loss=0.3525, pruned_loss=0.08199, over 17834.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.2999, pruned_loss=0.07477, over 4255269.83 frames. ], batch size: 60, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 16:26:48,802 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1372572.0, ans=0.125 2023-06-25 16:27:11,675 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1372632.0, ans=0.125 2023-06-25 16:27:12,649 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.814e+02 3.962e+02 5.141e+02 6.603e+02 1.300e+03, threshold=1.028e+03, percent-clipped=5.0 2023-06-25 16:27:57,459 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_ff3.min_abs, batch_count=1372752.0, ans=0.2 2023-06-25 16:28:02,654 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1372752.0, ans=0.0 2023-06-25 16:28:30,831 INFO [train.py:996] (0/4) Epoch 8, batch 15350, loss[loss=0.2051, simple_loss=0.3126, pruned_loss=0.04878, over 21810.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3053, pruned_loss=0.07696, over 4253451.36 frames. ], batch size: 282, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 16:29:07,837 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1372932.0, ans=0.0 2023-06-25 16:29:11,315 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1372932.0, ans=0.0 2023-06-25 16:30:06,952 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1373112.0, ans=0.2 2023-06-25 16:30:13,115 INFO [train.py:996] (0/4) Epoch 8, batch 15400, loss[loss=0.2218, simple_loss=0.3098, pruned_loss=0.06683, over 21932.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.308, pruned_loss=0.0757, over 4259937.23 frames. ], batch size: 118, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 16:30:46,928 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.892e+02 4.124e+02 5.602e+02 8.412e+02 1.592e+03, threshold=1.120e+03, percent-clipped=11.0 2023-06-25 16:30:54,829 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1373232.0, ans=0.125 2023-06-25 16:31:34,963 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 16:31:45,514 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1373412.0, ans=0.125 2023-06-25 16:31:54,587 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1373412.0, ans=0.125 2023-06-25 16:32:01,324 INFO [train.py:996] (0/4) Epoch 8, batch 15450, loss[loss=0.2262, simple_loss=0.3088, pruned_loss=0.07177, over 20703.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.3048, pruned_loss=0.07455, over 4262235.35 frames. ], batch size: 607, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 16:32:35,361 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1373532.0, ans=0.125 2023-06-25 16:32:44,664 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1373532.0, ans=0.0 2023-06-25 16:33:13,479 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.34 vs. limit=6.0 2023-06-25 16:34:02,563 INFO [train.py:996] (0/4) Epoch 8, batch 15500, loss[loss=0.1941, simple_loss=0.2844, pruned_loss=0.05189, over 15701.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.3055, pruned_loss=0.07438, over 4259256.48 frames. ], batch size: 60, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 16:34:12,165 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1373772.0, ans=0.0 2023-06-25 16:34:26,259 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1373832.0, ans=0.125 2023-06-25 16:34:27,244 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.729e+02 3.957e+02 5.678e+02 7.705e+02 1.506e+03, threshold=1.136e+03, percent-clipped=3.0 2023-06-25 16:34:39,521 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.83 vs. limit=15.0 2023-06-25 16:34:59,288 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1373892.0, ans=0.1 2023-06-25 16:35:53,647 INFO [train.py:996] (0/4) Epoch 8, batch 15550, loss[loss=0.2086, simple_loss=0.2932, pruned_loss=0.06204, over 21781.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.3026, pruned_loss=0.07251, over 4249389.70 frames. ], batch size: 371, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 16:36:29,278 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1374132.0, ans=0.0 2023-06-25 16:37:32,697 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1374312.0, ans=0.1 2023-06-25 16:37:34,784 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.54 vs. limit=15.0 2023-06-25 16:37:42,355 INFO [train.py:996] (0/4) Epoch 8, batch 15600, loss[loss=0.2098, simple_loss=0.2871, pruned_loss=0.06626, over 21767.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.296, pruned_loss=0.07092, over 4244927.34 frames. ], batch size: 371, lr: 3.73e-03, grad_scale: 32.0 2023-06-25 16:38:01,282 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.645e+02 3.371e+02 3.943e+02 5.908e+02 1.274e+03, threshold=7.887e+02, percent-clipped=2.0 2023-06-25 16:38:35,851 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1374492.0, ans=0.09899494936611666 2023-06-25 16:38:36,559 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.72 vs. limit=15.0 2023-06-25 16:38:47,884 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1374552.0, ans=0.125 2023-06-25 16:39:30,811 INFO [train.py:996] (0/4) Epoch 8, batch 15650, loss[loss=0.1761, simple_loss=0.2447, pruned_loss=0.05375, over 21382.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2947, pruned_loss=0.06989, over 4249514.11 frames. ], batch size: 211, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 16:39:40,484 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1374672.0, ans=0.0 2023-06-25 16:41:01,234 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1374912.0, ans=0.0 2023-06-25 16:41:19,320 INFO [train.py:996] (0/4) Epoch 8, batch 15700, loss[loss=0.2, simple_loss=0.2721, pruned_loss=0.06397, over 22048.00 frames. ], tot_loss[loss=0.215, simple_loss=0.2914, pruned_loss=0.06933, over 4247129.51 frames. ], batch size: 103, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 16:41:22,001 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1374972.0, ans=0.1 2023-06-25 16:41:36,363 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.54 vs. limit=22.5 2023-06-25 16:41:40,401 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.691e+02 3.513e+02 4.156e+02 5.605e+02 1.068e+03, threshold=8.312e+02, percent-clipped=8.0 2023-06-25 16:41:59,394 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1375092.0, ans=0.2 2023-06-25 16:42:34,245 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.21 vs. limit=15.0 2023-06-25 16:42:35,338 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 16:42:46,019 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1375212.0, ans=0.0 2023-06-25 16:43:06,894 INFO [train.py:996] (0/4) Epoch 8, batch 15750, loss[loss=0.1791, simple_loss=0.2442, pruned_loss=0.057, over 21263.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2871, pruned_loss=0.06904, over 4254565.27 frames. ], batch size: 211, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 16:43:11,319 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1375272.0, ans=0.0 2023-06-25 16:44:55,728 INFO [train.py:996] (0/4) Epoch 8, batch 15800, loss[loss=0.1943, simple_loss=0.2608, pruned_loss=0.06391, over 21772.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.2849, pruned_loss=0.06885, over 4253477.17 frames. ], batch size: 371, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 16:44:58,029 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1375572.0, ans=0.125 2023-06-25 16:45:16,669 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.881e+02 4.132e+02 5.788e+02 8.606e+02 2.042e+03, threshold=1.158e+03, percent-clipped=26.0 2023-06-25 16:45:26,592 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1375632.0, ans=0.1 2023-06-25 16:45:28,713 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1375632.0, ans=0.125 2023-06-25 16:45:33,814 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1375632.0, ans=0.0 2023-06-25 16:46:44,104 INFO [train.py:996] (0/4) Epoch 8, batch 15850, loss[loss=0.2168, simple_loss=0.2916, pruned_loss=0.07101, over 21714.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2869, pruned_loss=0.07084, over 4260702.48 frames. ], batch size: 332, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 16:46:57,386 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.84 vs. limit=15.0 2023-06-25 16:47:13,481 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.70 vs. limit=10.0 2023-06-25 16:47:19,673 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1375932.0, ans=0.0 2023-06-25 16:47:27,437 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=11.82 vs. limit=15.0 2023-06-25 16:47:30,207 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1375992.0, ans=0.125 2023-06-25 16:48:31,111 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1376172.0, ans=0.2 2023-06-25 16:48:32,045 INFO [train.py:996] (0/4) Epoch 8, batch 15900, loss[loss=0.1961, simple_loss=0.2593, pruned_loss=0.06648, over 21616.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2867, pruned_loss=0.07076, over 4263504.12 frames. ], batch size: 263, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 16:48:42,941 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 16:48:44,620 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1376172.0, ans=0.125 2023-06-25 16:48:52,531 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.061e+02 4.407e+02 5.744e+02 8.356e+02 1.559e+03, threshold=1.149e+03, percent-clipped=5.0 2023-06-25 16:49:05,618 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1376232.0, ans=0.125 2023-06-25 16:49:09,687 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.29 vs. limit=6.0 2023-06-25 16:49:14,750 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.67 vs. limit=10.0 2023-06-25 16:49:33,085 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1376352.0, ans=0.125 2023-06-25 16:50:19,045 INFO [train.py:996] (0/4) Epoch 8, batch 15950, loss[loss=0.1737, simple_loss=0.2577, pruned_loss=0.04489, over 21336.00 frames. ], tot_loss[loss=0.213, simple_loss=0.2875, pruned_loss=0.06928, over 4253774.81 frames. ], batch size: 194, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 16:51:15,747 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1376652.0, ans=0.125 2023-06-25 16:51:40,348 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1376712.0, ans=0.125 2023-06-25 16:51:49,730 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.59 vs. limit=15.0 2023-06-25 16:52:10,855 INFO [train.py:996] (0/4) Epoch 8, batch 16000, loss[loss=0.1868, simple_loss=0.2662, pruned_loss=0.05366, over 21180.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.2878, pruned_loss=0.06699, over 4251854.40 frames. ], batch size: 159, lr: 3.72e-03, grad_scale: 32.0 2023-06-25 16:52:31,819 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.505e+02 3.799e+02 4.877e+02 8.252e+02 1.708e+03, threshold=9.755e+02, percent-clipped=5.0 2023-06-25 16:52:35,933 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1376832.0, ans=0.125 2023-06-25 16:52:54,086 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1376892.0, ans=0.0 2023-06-25 16:53:25,381 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1377012.0, ans=0.0 2023-06-25 16:53:31,772 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1377012.0, ans=0.2 2023-06-25 16:53:59,494 INFO [train.py:996] (0/4) Epoch 8, batch 16050, loss[loss=0.2, simple_loss=0.2981, pruned_loss=0.05091, over 20749.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2904, pruned_loss=0.06597, over 4259194.64 frames. ], batch size: 607, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 16:54:02,004 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1377072.0, ans=0.125 2023-06-25 16:54:05,305 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1377072.0, ans=0.07 2023-06-25 16:54:20,577 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1377132.0, ans=0.0 2023-06-25 16:54:55,748 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1377252.0, ans=0.125 2023-06-25 16:55:07,829 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1377252.0, ans=0.125 2023-06-25 16:55:08,013 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1377252.0, ans=0.125 2023-06-25 16:55:47,386 INFO [train.py:996] (0/4) Epoch 8, batch 16100, loss[loss=0.2318, simple_loss=0.3346, pruned_loss=0.06452, over 21736.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.294, pruned_loss=0.06691, over 4265713.22 frames. ], batch size: 298, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 16:56:10,217 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.825e+02 4.315e+02 5.631e+02 9.006e+02 2.276e+03, threshold=1.126e+03, percent-clipped=22.0 2023-06-25 16:56:13,162 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.04 vs. limit=10.0 2023-06-25 16:56:14,223 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1377432.0, ans=0.125 2023-06-25 16:56:53,051 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1377552.0, ans=0.125 2023-06-25 16:57:23,528 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1377612.0, ans=0.125 2023-06-25 16:57:35,031 INFO [train.py:996] (0/4) Epoch 8, batch 16150, loss[loss=0.2225, simple_loss=0.2894, pruned_loss=0.07779, over 21754.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2922, pruned_loss=0.06869, over 4282503.26 frames. ], batch size: 389, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 16:57:35,677 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1377672.0, ans=0.125 2023-06-25 16:57:49,584 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1377672.0, ans=0.125 2023-06-25 16:57:49,634 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1377672.0, ans=0.125 2023-06-25 16:58:07,665 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.45 vs. limit=15.0 2023-06-25 16:59:13,648 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1377912.0, ans=0.0 2023-06-25 16:59:23,929 INFO [train.py:996] (0/4) Epoch 8, batch 16200, loss[loss=0.2363, simple_loss=0.3176, pruned_loss=0.07756, over 21405.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.2957, pruned_loss=0.0703, over 4286523.74 frames. ], batch size: 211, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 16:59:34,487 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1377972.0, ans=0.125 2023-06-25 16:59:46,144 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.109e+02 4.018e+02 5.082e+02 7.447e+02 1.479e+03, threshold=1.016e+03, percent-clipped=6.0 2023-06-25 16:59:57,385 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1378092.0, ans=0.0 2023-06-25 17:00:06,349 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1378092.0, ans=0.035 2023-06-25 17:01:11,823 INFO [train.py:996] (0/4) Epoch 8, batch 16250, loss[loss=0.229, simple_loss=0.3144, pruned_loss=0.07177, over 21742.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.2982, pruned_loss=0.07157, over 4278522.31 frames. ], batch size: 298, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 17:01:24,787 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1378272.0, ans=0.125 2023-06-25 17:01:37,443 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1378332.0, ans=0.125 2023-06-25 17:01:55,687 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.51 vs. limit=15.0 2023-06-25 17:01:57,222 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.45 vs. limit=10.0 2023-06-25 17:02:05,392 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1378392.0, ans=0.125 2023-06-25 17:02:59,372 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1378572.0, ans=0.125 2023-06-25 17:03:00,433 INFO [train.py:996] (0/4) Epoch 8, batch 16300, loss[loss=0.1787, simple_loss=0.2729, pruned_loss=0.04225, over 21734.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.2919, pruned_loss=0.06771, over 4262500.39 frames. ], batch size: 332, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 17:03:24,191 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.317e+02 3.309e+02 4.494e+02 6.869e+02 1.781e+03, threshold=8.988e+02, percent-clipped=11.0 2023-06-25 17:03:43,632 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.20 vs. limit=6.0 2023-06-25 17:03:59,172 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.24 vs. limit=12.0 2023-06-25 17:04:10,913 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1378752.0, ans=0.125 2023-06-25 17:04:29,065 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.04 vs. limit=15.0 2023-06-25 17:04:30,917 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.45 vs. limit=15.0 2023-06-25 17:04:49,718 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1378872.0, ans=0.2 2023-06-25 17:04:50,593 INFO [train.py:996] (0/4) Epoch 8, batch 16350, loss[loss=0.2229, simple_loss=0.301, pruned_loss=0.07238, over 21996.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2927, pruned_loss=0.06806, over 4261212.34 frames. ], batch size: 317, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 17:05:51,331 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1378992.0, ans=0.125 2023-06-25 17:05:51,368 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1378992.0, ans=0.0 2023-06-25 17:05:54,946 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.28 vs. limit=15.0 2023-06-25 17:06:30,896 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 17:06:30,939 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1379112.0, ans=0.0 2023-06-25 17:06:39,452 INFO [train.py:996] (0/4) Epoch 8, batch 16400, loss[loss=0.2006, simple_loss=0.2823, pruned_loss=0.05946, over 21826.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.2991, pruned_loss=0.07037, over 4262096.80 frames. ], batch size: 298, lr: 3.72e-03, grad_scale: 32.0 2023-06-25 17:06:52,685 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1379172.0, ans=0.1 2023-06-25 17:07:06,311 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1379232.0, ans=0.1 2023-06-25 17:07:09,124 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.039e+02 4.348e+02 5.266e+02 7.750e+02 2.110e+03, threshold=1.053e+03, percent-clipped=17.0 2023-06-25 17:07:33,970 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1379292.0, ans=0.2 2023-06-25 17:07:51,935 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.29 vs. limit=22.5 2023-06-25 17:08:00,473 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1379352.0, ans=0.025 2023-06-25 17:08:14,560 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1379412.0, ans=0.1 2023-06-25 17:08:22,737 INFO [train.py:996] (0/4) Epoch 8, batch 16450, loss[loss=0.1883, simple_loss=0.2673, pruned_loss=0.05464, over 21695.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.2982, pruned_loss=0.0711, over 4272038.08 frames. ], batch size: 230, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 17:08:28,991 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1379472.0, ans=0.125 2023-06-25 17:09:00,963 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1379532.0, ans=0.125 2023-06-25 17:10:06,332 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1379712.0, ans=0.125 2023-06-25 17:10:11,872 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1379772.0, ans=0.1 2023-06-25 17:10:12,898 INFO [train.py:996] (0/4) Epoch 8, batch 16500, loss[loss=0.1697, simple_loss=0.2351, pruned_loss=0.05217, over 21327.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.2956, pruned_loss=0.07077, over 4273126.02 frames. ], batch size: 176, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 17:10:43,404 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.095e+02 4.406e+02 5.923e+02 9.341e+02 2.012e+03, threshold=1.185e+03, percent-clipped=18.0 2023-06-25 17:12:03,247 INFO [train.py:996] (0/4) Epoch 8, batch 16550, loss[loss=0.2195, simple_loss=0.3027, pruned_loss=0.06822, over 21719.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2927, pruned_loss=0.06871, over 4264948.03 frames. ], batch size: 298, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 17:12:26,021 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=1380072.0, ans=0.025 2023-06-25 17:12:28,095 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1380072.0, ans=0.125 2023-06-25 17:12:39,831 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1380132.0, ans=0.125 2023-06-25 17:12:53,765 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1380132.0, ans=0.125 2023-06-25 17:13:08,262 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1380192.0, ans=0.0 2023-06-25 17:13:54,022 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1380312.0, ans=0.2 2023-06-25 17:14:02,551 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1380312.0, ans=0.0 2023-06-25 17:14:05,624 INFO [train.py:996] (0/4) Epoch 8, batch 16600, loss[loss=0.2871, simple_loss=0.3812, pruned_loss=0.09652, over 21643.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.3004, pruned_loss=0.07141, over 4269301.85 frames. ], batch size: 389, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 17:14:06,308 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1380372.0, ans=0.1 2023-06-25 17:14:15,188 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1380372.0, ans=0.0 2023-06-25 17:14:40,093 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1380432.0, ans=0.0 2023-06-25 17:14:41,013 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.056e+02 4.921e+02 6.632e+02 9.394e+02 2.372e+03, threshold=1.326e+03, percent-clipped=11.0 2023-06-25 17:15:04,912 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1380492.0, ans=0.2 2023-06-25 17:15:20,767 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1380552.0, ans=0.125 2023-06-25 17:16:01,854 INFO [train.py:996] (0/4) Epoch 8, batch 16650, loss[loss=0.2017, simple_loss=0.2625, pruned_loss=0.07049, over 20071.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.3081, pruned_loss=0.07385, over 4267652.36 frames. ], batch size: 703, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 17:16:29,318 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1380732.0, ans=0.125 2023-06-25 17:16:50,963 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.25 vs. limit=12.0 2023-06-25 17:17:28,692 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1380852.0, ans=0.2 2023-06-25 17:17:39,606 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1380912.0, ans=0.125 2023-06-25 17:17:58,064 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.32 vs. limit=10.0 2023-06-25 17:17:59,189 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1380972.0, ans=0.125 2023-06-25 17:18:00,243 INFO [train.py:996] (0/4) Epoch 8, batch 16700, loss[loss=0.1846, simple_loss=0.2506, pruned_loss=0.05935, over 21511.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.3081, pruned_loss=0.07418, over 4268343.68 frames. ], batch size: 211, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 17:18:01,093 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1380972.0, ans=0.125 2023-06-25 17:18:01,224 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1380972.0, ans=0.1 2023-06-25 17:18:26,042 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.158e+02 5.069e+02 7.220e+02 1.088e+03 2.234e+03, threshold=1.444e+03, percent-clipped=12.0 2023-06-25 17:19:22,799 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1381152.0, ans=0.1 2023-06-25 17:19:54,801 INFO [train.py:996] (0/4) Epoch 8, batch 16750, loss[loss=0.2144, simple_loss=0.27, pruned_loss=0.07941, over 20027.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3107, pruned_loss=0.07643, over 4270362.17 frames. ], batch size: 703, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 17:21:16,713 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1381452.0, ans=0.5 2023-06-25 17:21:37,282 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1381512.0, ans=0.0 2023-06-25 17:21:39,447 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.29 vs. limit=15.0 2023-06-25 17:21:47,569 INFO [train.py:996] (0/4) Epoch 8, batch 16800, loss[loss=0.2095, simple_loss=0.2763, pruned_loss=0.07136, over 21615.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3153, pruned_loss=0.07631, over 4266932.17 frames. ], batch size: 212, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 17:22:18,704 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.357e+02 4.342e+02 5.532e+02 7.799e+02 1.934e+03, threshold=1.106e+03, percent-clipped=5.0 2023-06-25 17:22:38,952 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1381692.0, ans=0.1 2023-06-25 17:22:41,360 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.33 vs. limit=15.0 2023-06-25 17:22:46,074 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1381692.0, ans=0.125 2023-06-25 17:23:06,673 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1381812.0, ans=0.0 2023-06-25 17:23:13,125 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1381812.0, ans=0.0 2023-06-25 17:23:21,503 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1381812.0, ans=0.125 2023-06-25 17:23:24,347 INFO [train.py:996] (0/4) Epoch 8, batch 16850, loss[loss=0.2338, simple_loss=0.3007, pruned_loss=0.08347, over 21739.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3109, pruned_loss=0.07646, over 4275647.32 frames. ], batch size: 389, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 17:23:42,261 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1381872.0, ans=0.125 2023-06-25 17:23:43,979 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1381872.0, ans=0.1 2023-06-25 17:23:44,675 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.59 vs. limit=15.0 2023-06-25 17:24:24,267 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1381992.0, ans=0.125 2023-06-25 17:24:38,040 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1382052.0, ans=0.07 2023-06-25 17:24:48,157 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1382052.0, ans=0.1 2023-06-25 17:25:11,494 INFO [train.py:996] (0/4) Epoch 8, batch 16900, loss[loss=0.2187, simple_loss=0.2953, pruned_loss=0.07101, over 21675.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.307, pruned_loss=0.07502, over 4283856.14 frames. ], batch size: 414, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 17:25:46,752 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1382232.0, ans=0.05 2023-06-25 17:25:58,367 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.598e+02 4.085e+02 5.568e+02 7.476e+02 1.428e+03, threshold=1.114e+03, percent-clipped=3.0 2023-06-25 17:26:18,024 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1382292.0, ans=0.05 2023-06-25 17:26:21,547 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1382292.0, ans=0.125 2023-06-25 17:26:59,129 INFO [train.py:996] (0/4) Epoch 8, batch 16950, loss[loss=0.2451, simple_loss=0.3719, pruned_loss=0.0591, over 20722.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.3012, pruned_loss=0.07361, over 4281965.84 frames. ], batch size: 607, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 17:27:10,252 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1382472.0, ans=0.1 2023-06-25 17:27:35,191 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.28 vs. limit=15.0 2023-06-25 17:28:05,541 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1382592.0, ans=0.1 2023-06-25 17:28:12,822 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 17:28:18,377 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1382652.0, ans=0.0 2023-06-25 17:28:38,045 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1382712.0, ans=0.035 2023-06-25 17:28:53,783 INFO [train.py:996] (0/4) Epoch 8, batch 17000, loss[loss=0.2332, simple_loss=0.3016, pruned_loss=0.08238, over 21941.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.2991, pruned_loss=0.07367, over 4290561.89 frames. ], batch size: 333, lr: 3.71e-03, grad_scale: 16.0 2023-06-25 17:29:35,994 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.768e+02 4.400e+02 6.237e+02 1.054e+03 1.925e+03, threshold=1.247e+03, percent-clipped=22.0 2023-06-25 17:29:59,500 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1382892.0, ans=0.125 2023-06-25 17:30:09,877 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1382952.0, ans=0.125 2023-06-25 17:30:14,120 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.69 vs. limit=8.0 2023-06-25 17:30:47,370 INFO [train.py:996] (0/4) Epoch 8, batch 17050, loss[loss=0.2243, simple_loss=0.3173, pruned_loss=0.06562, over 21636.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3052, pruned_loss=0.07568, over 4294072.85 frames. ], batch size: 263, lr: 3.71e-03, grad_scale: 16.0 2023-06-25 17:32:29,765 INFO [train.py:996] (0/4) Epoch 8, batch 17100, loss[loss=0.2354, simple_loss=0.305, pruned_loss=0.08296, over 21918.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.3037, pruned_loss=0.07588, over 4291577.47 frames. ], batch size: 414, lr: 3.71e-03, grad_scale: 16.0 2023-06-25 17:32:36,380 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.92 vs. limit=22.5 2023-06-25 17:32:46,435 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.55 vs. limit=12.0 2023-06-25 17:33:06,776 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.094e+02 4.538e+02 6.730e+02 8.383e+02 1.322e+03, threshold=1.346e+03, percent-clipped=2.0 2023-06-25 17:33:14,637 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1383492.0, ans=0.1 2023-06-25 17:33:22,872 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1383492.0, ans=0.2 2023-06-25 17:34:23,055 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.76 vs. limit=5.0 2023-06-25 17:34:23,424 INFO [train.py:996] (0/4) Epoch 8, batch 17150, loss[loss=0.2121, simple_loss=0.2704, pruned_loss=0.07687, over 21244.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.3002, pruned_loss=0.07553, over 4296358.40 frames. ], batch size: 608, lr: 3.71e-03, grad_scale: 16.0 2023-06-25 17:34:26,177 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1383672.0, ans=0.025 2023-06-25 17:34:29,414 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1383672.0, ans=0.0 2023-06-25 17:34:42,064 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1383672.0, ans=0.0 2023-06-25 17:34:45,849 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1383732.0, ans=0.0 2023-06-25 17:35:15,236 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1383792.0, ans=0.2 2023-06-25 17:35:16,952 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1383792.0, ans=0.0 2023-06-25 17:35:34,426 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.80 vs. limit=15.0 2023-06-25 17:35:35,204 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1383852.0, ans=0.125 2023-06-25 17:35:35,248 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1383852.0, ans=0.1 2023-06-25 17:35:36,977 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1383852.0, ans=0.125 2023-06-25 17:35:37,010 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1383852.0, ans=0.125 2023-06-25 17:36:18,566 INFO [train.py:996] (0/4) Epoch 8, batch 17200, loss[loss=0.234, simple_loss=0.3104, pruned_loss=0.07875, over 21345.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.2995, pruned_loss=0.07506, over 4294482.53 frames. ], batch size: 176, lr: 3.71e-03, grad_scale: 32.0 2023-06-25 17:36:44,519 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.991e+02 4.225e+02 5.384e+02 7.580e+02 1.533e+03, threshold=1.077e+03, percent-clipped=1.0 2023-06-25 17:36:48,734 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1384032.0, ans=0.04949747468305833 2023-06-25 17:37:19,748 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.38 vs. limit=15.0 2023-06-25 17:37:34,543 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1384152.0, ans=0.125 2023-06-25 17:38:01,741 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.72 vs. limit=12.0 2023-06-25 17:38:07,342 INFO [train.py:996] (0/4) Epoch 8, batch 17250, loss[loss=0.2625, simple_loss=0.3402, pruned_loss=0.0924, over 21424.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3037, pruned_loss=0.07662, over 4295349.69 frames. ], batch size: 471, lr: 3.71e-03, grad_scale: 16.0 2023-06-25 17:39:07,019 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff2.min_abs, batch_count=1384392.0, ans=0.1 2023-06-25 17:39:46,923 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1384512.0, ans=0.1 2023-06-25 17:39:57,091 INFO [train.py:996] (0/4) Epoch 8, batch 17300, loss[loss=0.2358, simple_loss=0.3168, pruned_loss=0.07737, over 21811.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3111, pruned_loss=0.08029, over 4294517.61 frames. ], batch size: 282, lr: 3.71e-03, grad_scale: 16.0 2023-06-25 17:39:57,711 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1384572.0, ans=0.125 2023-06-25 17:40:11,941 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1384572.0, ans=0.09899494936611666 2023-06-25 17:40:13,832 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1384632.0, ans=0.125 2023-06-25 17:40:25,190 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.394e+02 4.465e+02 6.350e+02 1.043e+03 2.141e+03, threshold=1.270e+03, percent-clipped=19.0 2023-06-25 17:40:35,112 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1384632.0, ans=0.0 2023-06-25 17:40:40,684 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1384692.0, ans=0.0 2023-06-25 17:41:14,270 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1384752.0, ans=0.125 2023-06-25 17:41:48,009 INFO [train.py:996] (0/4) Epoch 8, batch 17350, loss[loss=0.2323, simple_loss=0.3271, pruned_loss=0.06876, over 21723.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3112, pruned_loss=0.08033, over 4277458.79 frames. ], batch size: 441, lr: 3.71e-03, grad_scale: 16.0 2023-06-25 17:41:48,715 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1384872.0, ans=0.2 2023-06-25 17:42:17,566 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1384932.0, ans=0.125 2023-06-25 17:42:36,348 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1384932.0, ans=0.125 2023-06-25 17:43:38,089 INFO [train.py:996] (0/4) Epoch 8, batch 17400, loss[loss=0.1939, simple_loss=0.2716, pruned_loss=0.05807, over 21690.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3075, pruned_loss=0.07692, over 4268572.21 frames. ], batch size: 247, lr: 3.71e-03, grad_scale: 16.0 2023-06-25 17:43:45,836 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1385172.0, ans=0.125 2023-06-25 17:43:58,064 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1385172.0, ans=0.125 2023-06-25 17:44:22,059 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.624e+02 3.735e+02 4.973e+02 6.706e+02 2.674e+03, threshold=9.946e+02, percent-clipped=3.0 2023-06-25 17:44:37,768 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.62 vs. limit=10.0 2023-06-25 17:45:32,747 INFO [train.py:996] (0/4) Epoch 8, batch 17450, loss[loss=0.1705, simple_loss=0.2612, pruned_loss=0.03986, over 21380.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.3047, pruned_loss=0.07448, over 4273630.20 frames. ], batch size: 211, lr: 3.71e-03, grad_scale: 16.0 2023-06-25 17:46:32,108 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1385592.0, ans=0.0 2023-06-25 17:46:37,200 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1385592.0, ans=0.1 2023-06-25 17:47:01,994 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.12 vs. limit=15.0 2023-06-25 17:47:02,036 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.29 vs. limit=12.0 2023-06-25 17:47:20,322 INFO [train.py:996] (0/4) Epoch 8, batch 17500, loss[loss=0.2137, simple_loss=0.2846, pruned_loss=0.07139, over 21872.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.3013, pruned_loss=0.07266, over 4277993.42 frames. ], batch size: 332, lr: 3.71e-03, grad_scale: 16.0 2023-06-25 17:47:24,466 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1385772.0, ans=0.125 2023-06-25 17:47:57,721 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.917e+02 3.737e+02 5.034e+02 7.979e+02 1.418e+03, threshold=1.007e+03, percent-clipped=12.0 2023-06-25 17:48:26,146 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1385952.0, ans=0.0 2023-06-25 17:48:29,413 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1385952.0, ans=0.0 2023-06-25 17:49:07,141 INFO [train.py:996] (0/4) Epoch 8, batch 17550, loss[loss=0.2083, simple_loss=0.3036, pruned_loss=0.05649, over 21825.00 frames. ], tot_loss[loss=0.221, simple_loss=0.3007, pruned_loss=0.07063, over 4280987.55 frames. ], batch size: 316, lr: 3.71e-03, grad_scale: 16.0 2023-06-25 17:49:25,205 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1386072.0, ans=0.0 2023-06-25 17:49:26,668 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1386072.0, ans=0.0 2023-06-25 17:49:38,933 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1386132.0, ans=0.125 2023-06-25 17:50:50,298 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.43 vs. limit=6.0 2023-06-25 17:50:54,179 INFO [train.py:996] (0/4) Epoch 8, batch 17600, loss[loss=0.2385, simple_loss=0.3218, pruned_loss=0.07761, over 21500.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.3029, pruned_loss=0.07129, over 4271596.12 frames. ], batch size: 131, lr: 3.71e-03, grad_scale: 32.0 2023-06-25 17:51:03,873 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1386372.0, ans=0.2 2023-06-25 17:51:33,992 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.904e+02 3.918e+02 5.459e+02 7.837e+02 1.902e+03, threshold=1.092e+03, percent-clipped=12.0 2023-06-25 17:51:41,552 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1386492.0, ans=0.2 2023-06-25 17:51:47,693 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.83 vs. limit=22.5 2023-06-25 17:51:48,744 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1386492.0, ans=0.125 2023-06-25 17:52:02,553 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.42 vs. limit=12.0 2023-06-25 17:52:49,585 INFO [train.py:996] (0/4) Epoch 8, batch 17650, loss[loss=0.2422, simple_loss=0.3205, pruned_loss=0.08189, over 21252.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.3012, pruned_loss=0.07163, over 4266956.67 frames. ], batch size: 143, lr: 3.71e-03, grad_scale: 32.0 2023-06-25 17:52:54,330 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.73 vs. limit=12.0 2023-06-25 17:53:41,429 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1386792.0, ans=0.125 2023-06-25 17:54:39,352 INFO [train.py:996] (0/4) Epoch 8, batch 17700, loss[loss=0.2692, simple_loss=0.3505, pruned_loss=0.09395, over 21718.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2941, pruned_loss=0.06917, over 4249232.90 frames. ], batch size: 441, lr: 3.71e-03, grad_scale: 16.0 2023-06-25 17:55:14,590 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.821e+02 4.539e+02 6.208e+02 9.459e+02 1.772e+03, threshold=1.242e+03, percent-clipped=17.0 2023-06-25 17:55:24,385 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1387092.0, ans=0.04949747468305833 2023-06-25 17:55:24,792 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.83 vs. limit=15.0 2023-06-25 17:55:34,437 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1387092.0, ans=0.125 2023-06-25 17:55:59,330 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1387152.0, ans=0.125 2023-06-25 17:56:34,264 INFO [train.py:996] (0/4) Epoch 8, batch 17750, loss[loss=0.2291, simple_loss=0.31, pruned_loss=0.07411, over 21536.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.3017, pruned_loss=0.07261, over 4257160.09 frames. ], batch size: 112, lr: 3.71e-03, grad_scale: 16.0 2023-06-25 17:56:42,403 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1387272.0, ans=0.125 2023-06-25 17:56:56,675 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1387332.0, ans=0.125 2023-06-25 17:57:19,909 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1387392.0, ans=0.1 2023-06-25 17:57:55,949 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1387452.0, ans=0.2 2023-06-25 17:58:23,249 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.71 vs. limit=15.0 2023-06-25 17:58:25,872 INFO [train.py:996] (0/4) Epoch 8, batch 17800, loss[loss=0.2137, simple_loss=0.2729, pruned_loss=0.07725, over 20091.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.3004, pruned_loss=0.07129, over 4257817.02 frames. ], batch size: 702, lr: 3.71e-03, grad_scale: 16.0 2023-06-25 17:58:50,216 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.85 vs. limit=15.0 2023-06-25 17:58:55,817 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.433e+02 4.160e+02 4.945e+02 7.686e+02 1.227e+03, threshold=9.890e+02, percent-clipped=0.0 2023-06-25 17:59:26,598 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1387692.0, ans=0.125 2023-06-25 18:00:10,387 INFO [train.py:996] (0/4) Epoch 8, batch 17850, loss[loss=0.2289, simple_loss=0.2881, pruned_loss=0.08486, over 20154.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.299, pruned_loss=0.0711, over 4253190.45 frames. ], batch size: 702, lr: 3.71e-03, grad_scale: 16.0 2023-06-25 18:00:18,535 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.71 vs. limit=15.0 2023-06-25 18:01:25,277 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1388052.0, ans=0.2 2023-06-25 18:01:36,907 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=12.25 vs. limit=15.0 2023-06-25 18:01:54,651 INFO [train.py:996] (0/4) Epoch 8, batch 17900, loss[loss=0.2343, simple_loss=0.3342, pruned_loss=0.06721, over 21730.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.3048, pruned_loss=0.07307, over 4253278.49 frames. ], batch size: 351, lr: 3.71e-03, grad_scale: 16.0 2023-06-25 18:01:57,453 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1388172.0, ans=0.1 2023-06-25 18:02:40,618 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.096e+02 4.760e+02 6.226e+02 9.356e+02 2.163e+03, threshold=1.245e+03, percent-clipped=21.0 2023-06-25 18:02:55,533 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1388292.0, ans=0.125 2023-06-25 18:03:08,291 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1388352.0, ans=0.125 2023-06-25 18:03:15,573 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1388352.0, ans=0.125 2023-06-25 18:03:41,693 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1388412.0, ans=0.2 2023-06-25 18:03:44,499 INFO [train.py:996] (0/4) Epoch 8, batch 17950, loss[loss=0.1811, simple_loss=0.2748, pruned_loss=0.04368, over 21761.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.3058, pruned_loss=0.0707, over 4255400.89 frames. ], batch size: 298, lr: 3.71e-03, grad_scale: 16.0 2023-06-25 18:05:27,064 INFO [train.py:996] (0/4) Epoch 8, batch 18000, loss[loss=0.2135, simple_loss=0.2719, pruned_loss=0.0775, over 21320.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2998, pruned_loss=0.06855, over 4254939.25 frames. ], batch size: 160, lr: 3.71e-03, grad_scale: 32.0 2023-06-25 18:05:27,066 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-25 18:05:48,118 INFO [train.py:1028] (0/4) Epoch 8, validation: loss=0.2638, simple_loss=0.3571, pruned_loss=0.08527, over 1796401.00 frames. 2023-06-25 18:05:48,119 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23714MB 2023-06-25 18:05:48,750 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1388772.0, ans=0.125 2023-06-25 18:06:14,686 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1388832.0, ans=0.125 2023-06-25 18:06:23,292 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.397e+02 3.503e+02 4.294e+02 6.004e+02 1.457e+03, threshold=8.588e+02, percent-clipped=3.0 2023-06-25 18:06:32,515 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1388892.0, ans=0.125 2023-06-25 18:06:40,021 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.45 vs. limit=15.0 2023-06-25 18:06:46,691 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1388952.0, ans=0.125 2023-06-25 18:06:53,567 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1388952.0, ans=0.2 2023-06-25 18:07:37,511 INFO [train.py:996] (0/4) Epoch 8, batch 18050, loss[loss=0.2692, simple_loss=0.3317, pruned_loss=0.1034, over 21682.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.2955, pruned_loss=0.06861, over 4244777.93 frames. ], batch size: 441, lr: 3.71e-03, grad_scale: 32.0 2023-06-25 18:07:55,862 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1389072.0, ans=0.1 2023-06-25 18:08:24,607 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.36 vs. limit=8.0 2023-06-25 18:08:43,269 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1389252.0, ans=0.125 2023-06-25 18:09:32,765 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1389372.0, ans=0.125 2023-06-25 18:09:33,671 INFO [train.py:996] (0/4) Epoch 8, batch 18100, loss[loss=0.2135, simple_loss=0.3156, pruned_loss=0.05571, over 21636.00 frames. ], tot_loss[loss=0.2181, simple_loss=0.2968, pruned_loss=0.06966, over 4249694.73 frames. ], batch size: 263, lr: 3.71e-03, grad_scale: 16.0 2023-06-25 18:10:05,555 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.803e+02 3.773e+02 4.901e+02 6.840e+02 2.108e+03, threshold=9.801e+02, percent-clipped=15.0 2023-06-25 18:10:23,736 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 18:11:03,141 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1389612.0, ans=0.05 2023-06-25 18:11:16,646 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1389612.0, ans=0.0 2023-06-25 18:11:22,667 INFO [train.py:996] (0/4) Epoch 8, batch 18150, loss[loss=0.1997, simple_loss=0.2872, pruned_loss=0.05607, over 21325.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.2996, pruned_loss=0.07065, over 4250977.19 frames. ], batch size: 211, lr: 3.71e-03, grad_scale: 16.0 2023-06-25 18:12:13,475 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.00 vs. limit=15.0 2023-06-25 18:13:09,166 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1389972.0, ans=0.0 2023-06-25 18:13:10,107 INFO [train.py:996] (0/4) Epoch 8, batch 18200, loss[loss=0.1995, simple_loss=0.2708, pruned_loss=0.06405, over 21427.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2946, pruned_loss=0.0697, over 4249743.56 frames. ], batch size: 144, lr: 3.71e-03, grad_scale: 16.0 2023-06-25 18:13:30,876 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1390032.0, ans=0.125 2023-06-25 18:13:40,436 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.735e+02 4.047e+02 5.658e+02 8.715e+02 2.136e+03, threshold=1.132e+03, percent-clipped=16.0 2023-06-25 18:13:41,593 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.40 vs. limit=15.0 2023-06-25 18:14:18,495 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.56 vs. limit=15.0 2023-06-25 18:14:49,891 INFO [train.py:996] (0/4) Epoch 8, batch 18250, loss[loss=0.2471, simple_loss=0.3153, pruned_loss=0.08943, over 21858.00 frames. ], tot_loss[loss=0.2119, simple_loss=0.2879, pruned_loss=0.06794, over 4244711.26 frames. ], batch size: 107, lr: 3.70e-03, grad_scale: 16.0 2023-06-25 18:15:14,590 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1390332.0, ans=0.125 2023-06-25 18:15:17,987 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1390332.0, ans=0.0 2023-06-25 18:15:31,552 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1390392.0, ans=0.0 2023-06-25 18:15:31,560 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1390392.0, ans=0.125 2023-06-25 18:15:34,949 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1390392.0, ans=0.0 2023-06-25 18:15:51,127 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=10.24 vs. limit=10.0 2023-06-25 18:15:55,598 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1390452.0, ans=0.0 2023-06-25 18:16:26,097 INFO [train.py:996] (0/4) Epoch 8, batch 18300, loss[loss=0.2691, simple_loss=0.3469, pruned_loss=0.09571, over 19901.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.2886, pruned_loss=0.06742, over 4250122.69 frames. ], batch size: 702, lr: 3.70e-03, grad_scale: 16.0 2023-06-25 18:16:35,674 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1390572.0, ans=0.125 2023-06-25 18:17:11,327 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1390632.0, ans=0.1 2023-06-25 18:17:12,347 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.529e+02 4.046e+02 5.831e+02 1.006e+03 2.196e+03, threshold=1.166e+03, percent-clipped=19.0 2023-06-25 18:17:23,060 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1390692.0, ans=0.2 2023-06-25 18:17:43,804 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1390752.0, ans=0.0 2023-06-25 18:17:49,069 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 18:18:04,589 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1390812.0, ans=0.125 2023-06-25 18:18:12,736 INFO [train.py:996] (0/4) Epoch 8, batch 18350, loss[loss=0.1688, simple_loss=0.243, pruned_loss=0.04735, over 16973.00 frames. ], tot_loss[loss=0.2125, simple_loss=0.2911, pruned_loss=0.06697, over 4250833.08 frames. ], batch size: 65, lr: 3.70e-03, grad_scale: 16.0 2023-06-25 18:19:09,860 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1390992.0, ans=0.125 2023-06-25 18:19:20,023 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1391052.0, ans=0.125 2023-06-25 18:19:56,469 INFO [train.py:996] (0/4) Epoch 8, batch 18400, loss[loss=0.1812, simple_loss=0.2663, pruned_loss=0.04803, over 21629.00 frames. ], tot_loss[loss=0.209, simple_loss=0.287, pruned_loss=0.06544, over 4247056.38 frames. ], batch size: 391, lr: 3.70e-03, grad_scale: 32.0 2023-06-25 18:20:01,956 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1391172.0, ans=0.125 2023-06-25 18:20:38,567 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.600e+02 3.733e+02 5.113e+02 7.460e+02 1.718e+03, threshold=1.023e+03, percent-clipped=6.0 2023-06-25 18:20:53,301 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1391292.0, ans=0.04949747468305833 2023-06-25 18:20:55,173 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 18:20:56,756 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=1391292.0, ans=10.0 2023-06-25 18:21:12,980 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.46 vs. limit=15.0 2023-06-25 18:21:45,630 INFO [train.py:996] (0/4) Epoch 8, batch 18450, loss[loss=0.2429, simple_loss=0.3527, pruned_loss=0.06655, over 19918.00 frames. ], tot_loss[loss=0.2054, simple_loss=0.2856, pruned_loss=0.06258, over 4247115.83 frames. ], batch size: 702, lr: 3.70e-03, grad_scale: 32.0 2023-06-25 18:22:06,346 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1391472.0, ans=0.125 2023-06-25 18:22:29,035 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.03 vs. limit=22.5 2023-06-25 18:22:33,578 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1391592.0, ans=0.2 2023-06-25 18:22:40,714 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1391592.0, ans=0.125 2023-06-25 18:23:26,118 INFO [train.py:996] (0/4) Epoch 8, batch 18500, loss[loss=0.1759, simple_loss=0.2631, pruned_loss=0.04434, over 21742.00 frames. ], tot_loss[loss=0.2021, simple_loss=0.2808, pruned_loss=0.06166, over 4248214.39 frames. ], batch size: 282, lr: 3.70e-03, grad_scale: 32.0 2023-06-25 18:23:34,008 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1391772.0, ans=0.125 2023-06-25 18:23:57,992 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.84 vs. limit=15.0 2023-06-25 18:24:07,378 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.590e+02 3.343e+02 4.214e+02 5.911e+02 1.246e+03, threshold=8.429e+02, percent-clipped=4.0 2023-06-25 18:24:10,063 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1391892.0, ans=0.125 2023-06-25 18:24:12,005 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=1391892.0, ans=15.0 2023-06-25 18:24:13,373 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1391892.0, ans=0.1 2023-06-25 18:24:25,079 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1391892.0, ans=0.0 2023-06-25 18:24:31,138 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.97 vs. limit=5.0 2023-06-25 18:24:45,497 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-232000.pt 2023-06-25 18:24:53,296 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.25 vs. limit=22.5 2023-06-25 18:25:08,898 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1392072.0, ans=0.125 2023-06-25 18:25:09,787 INFO [train.py:996] (0/4) Epoch 8, batch 18550, loss[loss=0.1995, simple_loss=0.2664, pruned_loss=0.06628, over 21815.00 frames. ], tot_loss[loss=0.1993, simple_loss=0.2768, pruned_loss=0.06091, over 4251287.04 frames. ], batch size: 107, lr: 3.70e-03, grad_scale: 32.0 2023-06-25 18:26:10,883 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1392192.0, ans=10.0 2023-06-25 18:27:04,683 INFO [train.py:996] (0/4) Epoch 8, batch 18600, loss[loss=0.1991, simple_loss=0.2818, pruned_loss=0.05815, over 21796.00 frames. ], tot_loss[loss=0.2001, simple_loss=0.2769, pruned_loss=0.06162, over 4244778.33 frames. ], batch size: 317, lr: 3.70e-03, grad_scale: 16.0 2023-06-25 18:27:36,467 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.831e+02 3.804e+02 5.092e+02 7.468e+02 1.783e+03, threshold=1.018e+03, percent-clipped=18.0 2023-06-25 18:27:47,071 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1392492.0, ans=0.125 2023-06-25 18:28:17,457 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1392612.0, ans=0.0 2023-06-25 18:28:19,856 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.20 vs. limit=15.0 2023-06-25 18:28:33,916 INFO [train.py:996] (0/4) Epoch 8, batch 18650, loss[loss=0.2108, simple_loss=0.2768, pruned_loss=0.07235, over 15336.00 frames. ], tot_loss[loss=0.2002, simple_loss=0.2764, pruned_loss=0.062, over 4241500.95 frames. ], batch size: 60, lr: 3.70e-03, grad_scale: 16.0 2023-06-25 18:28:54,832 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1392672.0, ans=0.04949747468305833 2023-06-25 18:29:02,962 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1392732.0, ans=0.2 2023-06-25 18:29:20,452 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1392792.0, ans=0.0 2023-06-25 18:29:59,943 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1392912.0, ans=0.1 2023-06-25 18:30:16,030 INFO [train.py:996] (0/4) Epoch 8, batch 18700, loss[loss=0.2262, simple_loss=0.2923, pruned_loss=0.08008, over 21862.00 frames. ], tot_loss[loss=0.2016, simple_loss=0.2752, pruned_loss=0.064, over 4257634.95 frames. ], batch size: 414, lr: 3.70e-03, grad_scale: 16.0 2023-06-25 18:31:04,147 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.953e+02 3.708e+02 4.986e+02 6.996e+02 1.849e+03, threshold=9.973e+02, percent-clipped=6.0 2023-06-25 18:31:29,253 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1393152.0, ans=0.125 2023-06-25 18:32:03,269 INFO [train.py:996] (0/4) Epoch 8, batch 18750, loss[loss=0.2379, simple_loss=0.3156, pruned_loss=0.08012, over 21398.00 frames. ], tot_loss[loss=0.2042, simple_loss=0.2768, pruned_loss=0.0658, over 4259998.72 frames. ], batch size: 194, lr: 3.70e-03, grad_scale: 16.0 2023-06-25 18:32:21,347 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1393272.0, ans=0.025 2023-06-25 18:32:40,744 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1393332.0, ans=0.0 2023-06-25 18:33:04,077 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1393392.0, ans=0.04949747468305833 2023-06-25 18:33:48,490 INFO [train.py:996] (0/4) Epoch 8, batch 18800, loss[loss=0.2299, simple_loss=0.3061, pruned_loss=0.07686, over 21767.00 frames. ], tot_loss[loss=0.2098, simple_loss=0.2844, pruned_loss=0.06761, over 4252387.66 frames. ], batch size: 351, lr: 3.70e-03, grad_scale: 32.0 2023-06-25 18:34:31,451 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.68 vs. limit=15.0 2023-06-25 18:34:31,903 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.846e+02 4.247e+02 5.340e+02 7.897e+02 1.499e+03, threshold=1.068e+03, percent-clipped=10.0 2023-06-25 18:34:39,531 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1393692.0, ans=0.125 2023-06-25 18:34:54,842 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1393752.0, ans=0.1 2023-06-25 18:35:31,318 INFO [train.py:996] (0/4) Epoch 8, batch 18850, loss[loss=0.1915, simple_loss=0.2564, pruned_loss=0.06324, over 21258.00 frames. ], tot_loss[loss=0.2036, simple_loss=0.2801, pruned_loss=0.06352, over 4250439.72 frames. ], batch size: 176, lr: 3.70e-03, grad_scale: 32.0 2023-06-25 18:36:17,244 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1393932.0, ans=0.015 2023-06-25 18:36:41,010 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 18:36:55,246 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.00 vs. limit=15.0 2023-06-25 18:37:09,527 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.71 vs. limit=15.0 2023-06-25 18:37:18,490 INFO [train.py:996] (0/4) Epoch 8, batch 18900, loss[loss=0.2333, simple_loss=0.3248, pruned_loss=0.07093, over 20916.00 frames. ], tot_loss[loss=0.2016, simple_loss=0.2768, pruned_loss=0.0632, over 4241910.96 frames. ], batch size: 607, lr: 3.70e-03, grad_scale: 16.0 2023-06-25 18:37:58,312 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.57 vs. limit=6.0 2023-06-25 18:38:09,198 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.328e+02 3.589e+02 4.833e+02 6.205e+02 1.384e+03, threshold=9.667e+02, percent-clipped=4.0 2023-06-25 18:38:13,475 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1394292.0, ans=0.0 2023-06-25 18:38:35,047 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.79 vs. limit=15.0 2023-06-25 18:39:07,634 INFO [train.py:996] (0/4) Epoch 8, batch 18950, loss[loss=0.234, simple_loss=0.3411, pruned_loss=0.06349, over 21243.00 frames. ], tot_loss[loss=0.2045, simple_loss=0.2789, pruned_loss=0.06501, over 4248461.19 frames. ], batch size: 548, lr: 3.70e-03, grad_scale: 16.0 2023-06-25 18:39:57,785 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1394592.0, ans=0.0 2023-06-25 18:40:26,088 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1394652.0, ans=0.035 2023-06-25 18:40:38,321 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1394712.0, ans=0.125 2023-06-25 18:41:08,073 INFO [train.py:996] (0/4) Epoch 8, batch 19000, loss[loss=0.2178, simple_loss=0.3105, pruned_loss=0.0626, over 21536.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.2889, pruned_loss=0.06699, over 4254133.34 frames. ], batch size: 212, lr: 3.70e-03, grad_scale: 16.0 2023-06-25 18:41:39,092 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.60 vs. limit=10.0 2023-06-25 18:41:43,477 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1394832.0, ans=10.0 2023-06-25 18:41:47,835 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.853e+02 4.722e+02 6.033e+02 9.741e+02 2.203e+03, threshold=1.207e+03, percent-clipped=24.0 2023-06-25 18:42:04,203 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=1394952.0, ans=10.0 2023-06-25 18:42:32,043 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1395012.0, ans=0.0 2023-06-25 18:42:43,189 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 18:42:56,857 INFO [train.py:996] (0/4) Epoch 8, batch 19050, loss[loss=0.2856, simple_loss=0.3519, pruned_loss=0.1097, over 21418.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.2945, pruned_loss=0.07093, over 4263374.93 frames. ], batch size: 471, lr: 3.70e-03, grad_scale: 16.0 2023-06-25 18:43:09,526 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=1395072.0, ans=10.0 2023-06-25 18:43:28,101 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1395132.0, ans=0.0 2023-06-25 18:43:48,666 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1395192.0, ans=0.5 2023-06-25 18:43:48,670 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1395192.0, ans=0.2 2023-06-25 18:44:14,826 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1395312.0, ans=0.0 2023-06-25 18:44:36,202 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1395312.0, ans=0.1 2023-06-25 18:44:44,108 INFO [train.py:996] (0/4) Epoch 8, batch 19100, loss[loss=0.2173, simple_loss=0.2832, pruned_loss=0.07573, over 21263.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2949, pruned_loss=0.07286, over 4269662.76 frames. ], batch size: 548, lr: 3.70e-03, grad_scale: 16.0 2023-06-25 18:45:00,721 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1395432.0, ans=0.2 2023-06-25 18:45:11,472 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1395432.0, ans=0.0 2023-06-25 18:45:19,083 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1395492.0, ans=0.5 2023-06-25 18:45:19,990 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.078e+02 4.021e+02 4.752e+02 6.454e+02 2.086e+03, threshold=9.504e+02, percent-clipped=4.0 2023-06-25 18:46:13,635 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.46 vs. limit=15.0 2023-06-25 18:46:30,737 INFO [train.py:996] (0/4) Epoch 8, batch 19150, loss[loss=0.2414, simple_loss=0.335, pruned_loss=0.07393, over 21735.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.2966, pruned_loss=0.07374, over 4277299.26 frames. ], batch size: 282, lr: 3.70e-03, grad_scale: 16.0 2023-06-25 18:46:44,955 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.46 vs. limit=15.0 2023-06-25 18:46:49,927 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1395732.0, ans=0.09899494936611666 2023-06-25 18:47:31,179 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1395852.0, ans=0.0 2023-06-25 18:48:21,144 INFO [train.py:996] (0/4) Epoch 8, batch 19200, loss[loss=0.2811, simple_loss=0.379, pruned_loss=0.09162, over 21490.00 frames. ], tot_loss[loss=0.226, simple_loss=0.3052, pruned_loss=0.07343, over 4283000.19 frames. ], batch size: 471, lr: 3.70e-03, grad_scale: 32.0 2023-06-25 18:48:53,699 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.18 vs. limit=15.0 2023-06-25 18:49:00,519 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.955e+02 4.242e+02 5.606e+02 9.141e+02 1.658e+03, threshold=1.121e+03, percent-clipped=22.0 2023-06-25 18:49:17,058 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.34 vs. limit=15.0 2023-06-25 18:50:01,398 INFO [train.py:996] (0/4) Epoch 8, batch 19250, loss[loss=0.1934, simple_loss=0.2745, pruned_loss=0.0562, over 21403.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.3033, pruned_loss=0.06859, over 4279231.27 frames. ], batch size: 131, lr: 3.70e-03, grad_scale: 32.0 2023-06-25 18:51:15,281 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1396452.0, ans=0.0 2023-06-25 18:51:26,435 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.25 vs. limit=15.0 2023-06-25 18:51:35,054 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1396512.0, ans=0.07 2023-06-25 18:51:42,176 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1396512.0, ans=0.125 2023-06-25 18:51:44,986 INFO [train.py:996] (0/4) Epoch 8, batch 19300, loss[loss=0.2352, simple_loss=0.2992, pruned_loss=0.08564, over 21881.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.3002, pruned_loss=0.06843, over 4284735.40 frames. ], batch size: 124, lr: 3.70e-03, grad_scale: 32.0 2023-06-25 18:52:05,613 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 18:52:28,404 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.307e+02 3.708e+02 5.763e+02 8.363e+02 1.771e+03, threshold=1.153e+03, percent-clipped=11.0 2023-06-25 18:52:34,132 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1396692.0, ans=0.125 2023-06-25 18:53:37,202 INFO [train.py:996] (0/4) Epoch 8, batch 19350, loss[loss=0.1742, simple_loss=0.2609, pruned_loss=0.04377, over 21685.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2958, pruned_loss=0.06547, over 4283485.35 frames. ], batch size: 247, lr: 3.70e-03, grad_scale: 32.0 2023-06-25 18:54:14,641 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1396932.0, ans=0.125 2023-06-25 18:54:20,237 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.21 vs. limit=15.0 2023-06-25 18:55:05,938 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.94 vs. limit=15.0 2023-06-25 18:55:25,382 INFO [train.py:996] (0/4) Epoch 8, batch 19400, loss[loss=0.2052, simple_loss=0.2835, pruned_loss=0.06349, over 21779.00 frames. ], tot_loss[loss=0.2119, simple_loss=0.2943, pruned_loss=0.06474, over 4283736.38 frames. ], batch size: 298, lr: 3.70e-03, grad_scale: 16.0 2023-06-25 18:55:57,787 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1397232.0, ans=0.09899494936611666 2023-06-25 18:56:04,609 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 18:56:07,070 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.659e+02 3.804e+02 4.878e+02 6.968e+02 1.951e+03, threshold=9.756e+02, percent-clipped=7.0 2023-06-25 18:56:42,859 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1397352.0, ans=0.125 2023-06-25 18:56:44,618 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1397352.0, ans=0.0 2023-06-25 18:56:48,147 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1397352.0, ans=0.125 2023-06-25 18:57:05,533 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1397412.0, ans=0.125 2023-06-25 18:57:09,335 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1397412.0, ans=0.2 2023-06-25 18:57:11,763 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.61 vs. limit=15.0 2023-06-25 18:57:13,666 INFO [train.py:996] (0/4) Epoch 8, batch 19450, loss[loss=0.2201, simple_loss=0.2786, pruned_loss=0.08081, over 21685.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2908, pruned_loss=0.06594, over 4290879.05 frames. ], batch size: 414, lr: 3.70e-03, grad_scale: 16.0 2023-06-25 18:57:19,596 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1397472.0, ans=0.125 2023-06-25 18:57:27,232 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.61 vs. limit=10.0 2023-06-25 18:57:35,915 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1397532.0, ans=0.1 2023-06-25 18:57:55,064 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1397592.0, ans=0.125 2023-06-25 18:58:06,725 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1397592.0, ans=0.125 2023-06-25 18:58:17,663 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1397592.0, ans=0.125 2023-06-25 18:58:35,197 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1397652.0, ans=0.2 2023-06-25 18:58:52,444 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1397712.0, ans=0.05 2023-06-25 18:58:59,099 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1397712.0, ans=0.1 2023-06-25 18:59:01,894 INFO [train.py:996] (0/4) Epoch 8, batch 19500, loss[loss=0.2316, simple_loss=0.2961, pruned_loss=0.08352, over 21856.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.2872, pruned_loss=0.06699, over 4288911.79 frames. ], batch size: 98, lr: 3.69e-03, grad_scale: 16.0 2023-06-25 18:59:18,796 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.06 vs. limit=10.0 2023-06-25 18:59:47,842 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.224e+02 4.180e+02 5.667e+02 7.986e+02 1.317e+03, threshold=1.133e+03, percent-clipped=13.0 2023-06-25 19:00:23,389 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=19.19 vs. limit=22.5 2023-06-25 19:00:42,904 INFO [train.py:996] (0/4) Epoch 8, batch 19550, loss[loss=0.1949, simple_loss=0.2963, pruned_loss=0.04671, over 21765.00 frames. ], tot_loss[loss=0.2075, simple_loss=0.2835, pruned_loss=0.06571, over 4278057.00 frames. ], batch size: 332, lr: 3.69e-03, grad_scale: 16.0 2023-06-25 19:01:09,611 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.69 vs. limit=12.0 2023-06-25 19:01:44,255 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.09 vs. limit=6.0 2023-06-25 19:01:51,115 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1398252.0, ans=0.125 2023-06-25 19:01:52,575 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1398252.0, ans=0.1 2023-06-25 19:02:23,115 INFO [train.py:996] (0/4) Epoch 8, batch 19600, loss[loss=0.2186, simple_loss=0.2856, pruned_loss=0.07576, over 21951.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.2856, pruned_loss=0.06683, over 4281312.74 frames. ], batch size: 316, lr: 3.69e-03, grad_scale: 32.0 2023-06-25 19:02:43,126 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.74 vs. limit=15.0 2023-06-25 19:02:54,769 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1398432.0, ans=0.125 2023-06-25 19:03:01,444 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1398492.0, ans=0.5 2023-06-25 19:03:10,844 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.701e+02 4.318e+02 6.046e+02 9.838e+02 1.787e+03, threshold=1.209e+03, percent-clipped=19.0 2023-06-25 19:03:11,600 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1398492.0, ans=0.0 2023-06-25 19:03:28,661 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1398552.0, ans=0.0 2023-06-25 19:03:48,544 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1398552.0, ans=0.1 2023-06-25 19:03:50,945 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=6.36 vs. limit=22.5 2023-06-25 19:03:57,439 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 19:04:10,730 INFO [train.py:996] (0/4) Epoch 8, batch 19650, loss[loss=0.2603, simple_loss=0.3379, pruned_loss=0.09138, over 21767.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.2902, pruned_loss=0.06982, over 4281604.54 frames. ], batch size: 124, lr: 3.69e-03, grad_scale: 16.0 2023-06-25 19:04:34,223 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1398732.0, ans=0.125 2023-06-25 19:05:36,315 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.26 vs. limit=15.0 2023-06-25 19:05:43,703 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.06 vs. limit=6.0 2023-06-25 19:06:07,316 INFO [train.py:996] (0/4) Epoch 8, batch 19700, loss[loss=0.2004, simple_loss=0.2842, pruned_loss=0.05828, over 21648.00 frames. ], tot_loss[loss=0.2181, simple_loss=0.2938, pruned_loss=0.07116, over 4278521.90 frames. ], batch size: 247, lr: 3.69e-03, grad_scale: 16.0 2023-06-25 19:06:28,672 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1398972.0, ans=0.125 2023-06-25 19:07:03,159 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.726e+02 4.245e+02 5.228e+02 6.853e+02 1.147e+03, threshold=1.046e+03, percent-clipped=0.0 2023-06-25 19:07:15,887 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1399152.0, ans=0.1 2023-06-25 19:07:45,396 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1399212.0, ans=0.2 2023-06-25 19:08:01,686 INFO [train.py:996] (0/4) Epoch 8, batch 19750, loss[loss=0.2456, simple_loss=0.3398, pruned_loss=0.07569, over 21715.00 frames. ], tot_loss[loss=0.2219, simple_loss=0.3005, pruned_loss=0.07159, over 4276085.46 frames. ], batch size: 247, lr: 3.69e-03, grad_scale: 16.0 2023-06-25 19:08:44,198 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1399332.0, ans=0.125 2023-06-25 19:08:44,233 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1399332.0, ans=0.1 2023-06-25 19:08:48,255 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.15 vs. limit=15.0 2023-06-25 19:09:44,844 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1399512.0, ans=0.2 2023-06-25 19:09:54,479 INFO [train.py:996] (0/4) Epoch 8, batch 19800, loss[loss=0.2285, simple_loss=0.2931, pruned_loss=0.08195, over 21366.00 frames. ], tot_loss[loss=0.222, simple_loss=0.3005, pruned_loss=0.0718, over 4282792.27 frames. ], batch size: 159, lr: 3.69e-03, grad_scale: 16.0 2023-06-25 19:10:23,741 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.79 vs. limit=15.0 2023-06-25 19:10:34,469 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.86 vs. limit=15.0 2023-06-25 19:10:38,208 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.249e+02 4.512e+02 5.932e+02 8.767e+02 2.271e+03, threshold=1.186e+03, percent-clipped=19.0 2023-06-25 19:10:38,851 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1399692.0, ans=0.125 2023-06-25 19:11:42,800 INFO [train.py:996] (0/4) Epoch 8, batch 19850, loss[loss=0.2518, simple_loss=0.3578, pruned_loss=0.0729, over 19783.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.2936, pruned_loss=0.06781, over 4281439.97 frames. ], batch size: 703, lr: 3.69e-03, grad_scale: 16.0 2023-06-25 19:12:41,946 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1399992.0, ans=0.125 2023-06-25 19:12:52,387 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1400052.0, ans=0.125 2023-06-25 19:13:21,324 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1400112.0, ans=0.07 2023-06-25 19:13:22,735 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1400112.0, ans=0.125 2023-06-25 19:13:27,681 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1400172.0, ans=0.2 2023-06-25 19:13:28,608 INFO [train.py:996] (0/4) Epoch 8, batch 19900, loss[loss=0.1963, simple_loss=0.279, pruned_loss=0.0568, over 21602.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2934, pruned_loss=0.0649, over 4276992.04 frames. ], batch size: 414, lr: 3.69e-03, grad_scale: 16.0 2023-06-25 19:14:17,375 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.623e+02 3.554e+02 4.496e+02 7.903e+02 1.499e+03, threshold=8.992e+02, percent-clipped=4.0 2023-06-25 19:15:14,961 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1400412.0, ans=0.125 2023-06-25 19:15:17,769 INFO [train.py:996] (0/4) Epoch 8, batch 19950, loss[loss=0.1849, simple_loss=0.2521, pruned_loss=0.05881, over 21145.00 frames. ], tot_loss[loss=0.2091, simple_loss=0.2887, pruned_loss=0.06477, over 4271303.50 frames. ], batch size: 548, lr: 3.69e-03, grad_scale: 16.0 2023-06-25 19:15:47,073 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1400532.0, ans=0.125 2023-06-25 19:16:17,877 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1400592.0, ans=0.125 2023-06-25 19:17:02,059 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1400712.0, ans=0.125 2023-06-25 19:17:12,755 INFO [train.py:996] (0/4) Epoch 8, batch 20000, loss[loss=0.2467, simple_loss=0.3206, pruned_loss=0.08643, over 21749.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2914, pruned_loss=0.0657, over 4270865.87 frames. ], batch size: 441, lr: 3.69e-03, grad_scale: 32.0 2023-06-25 19:17:16,551 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1400772.0, ans=0.0 2023-06-25 19:17:55,683 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.991e+02 3.942e+02 5.343e+02 7.186e+02 1.508e+03, threshold=1.069e+03, percent-clipped=12.0 2023-06-25 19:18:58,679 INFO [train.py:996] (0/4) Epoch 8, batch 20050, loss[loss=0.2345, simple_loss=0.2926, pruned_loss=0.08823, over 20157.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.2927, pruned_loss=0.06849, over 4278147.15 frames. ], batch size: 707, lr: 3.69e-03, grad_scale: 16.0 2023-06-25 19:19:11,760 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1401072.0, ans=0.125 2023-06-25 19:19:14,958 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1401132.0, ans=0.0 2023-06-25 19:19:27,405 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1401132.0, ans=0.125 2023-06-25 19:19:48,766 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1401192.0, ans=0.0 2023-06-25 19:20:16,525 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1401252.0, ans=0.1 2023-06-25 19:20:48,084 INFO [train.py:996] (0/4) Epoch 8, batch 20100, loss[loss=0.285, simple_loss=0.3776, pruned_loss=0.09618, over 21697.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.2964, pruned_loss=0.07112, over 4281897.95 frames. ], batch size: 441, lr: 3.69e-03, grad_scale: 16.0 2023-06-25 19:21:19,881 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1401432.0, ans=0.125 2023-06-25 19:21:33,504 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.930e+02 3.809e+02 4.961e+02 6.304e+02 1.570e+03, threshold=9.921e+02, percent-clipped=3.0 2023-06-25 19:21:42,483 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1401492.0, ans=0.05 2023-06-25 19:22:38,498 INFO [train.py:996] (0/4) Epoch 8, batch 20150, loss[loss=0.2593, simple_loss=0.3447, pruned_loss=0.08692, over 21350.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.3044, pruned_loss=0.07395, over 4278454.20 frames. ], batch size: 548, lr: 3.69e-03, grad_scale: 16.0 2023-06-25 19:22:59,175 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1401672.0, ans=0.125 2023-06-25 19:23:38,655 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1401792.0, ans=0.0 2023-06-25 19:24:07,984 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1401852.0, ans=0.1 2023-06-25 19:24:15,460 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1401912.0, ans=0.1 2023-06-25 19:24:35,466 INFO [train.py:996] (0/4) Epoch 8, batch 20200, loss[loss=0.2526, simple_loss=0.3666, pruned_loss=0.06933, over 20813.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3105, pruned_loss=0.07684, over 4271254.36 frames. ], batch size: 607, lr: 3.69e-03, grad_scale: 16.0 2023-06-25 19:24:59,115 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=1402032.0, ans=15.0 2023-06-25 19:25:25,537 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.344e+02 4.250e+02 5.853e+02 8.923e+02 1.822e+03, threshold=1.171e+03, percent-clipped=17.0 2023-06-25 19:25:57,518 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1402152.0, ans=0.1 2023-06-25 19:26:23,027 INFO [train.py:996] (0/4) Epoch 8, batch 20250, loss[loss=0.1915, simple_loss=0.2799, pruned_loss=0.05158, over 21770.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3092, pruned_loss=0.07469, over 4270138.85 frames. ], batch size: 247, lr: 3.69e-03, grad_scale: 16.0 2023-06-25 19:26:57,639 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.59 vs. limit=15.0 2023-06-25 19:26:59,367 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.12 vs. limit=8.0 2023-06-25 19:27:08,874 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1402392.0, ans=0.0 2023-06-25 19:27:19,454 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1402392.0, ans=0.04949747468305833 2023-06-25 19:27:41,509 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1402452.0, ans=0.125 2023-06-25 19:28:15,776 INFO [train.py:996] (0/4) Epoch 8, batch 20300, loss[loss=0.171, simple_loss=0.2455, pruned_loss=0.04824, over 21906.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.3064, pruned_loss=0.07192, over 4276425.54 frames. ], batch size: 107, lr: 3.69e-03, grad_scale: 16.0 2023-06-25 19:28:28,122 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.73 vs. limit=15.0 2023-06-25 19:28:58,617 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.664e+02 3.704e+02 5.083e+02 7.003e+02 2.093e+03, threshold=1.017e+03, percent-clipped=9.0 2023-06-25 19:29:10,630 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.73 vs. limit=15.0 2023-06-25 19:29:27,119 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1402752.0, ans=0.05 2023-06-25 19:29:56,348 INFO [train.py:996] (0/4) Epoch 8, batch 20350, loss[loss=0.2272, simple_loss=0.31, pruned_loss=0.07217, over 21430.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.3055, pruned_loss=0.07164, over 4262599.59 frames. ], batch size: 131, lr: 3.69e-03, grad_scale: 16.0 2023-06-25 19:30:47,445 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.50 vs. limit=22.5 2023-06-25 19:31:17,170 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1403052.0, ans=0.1 2023-06-25 19:31:24,311 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1403112.0, ans=0.1 2023-06-25 19:31:44,545 INFO [train.py:996] (0/4) Epoch 8, batch 20400, loss[loss=0.1914, simple_loss=0.2624, pruned_loss=0.06022, over 16336.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.307, pruned_loss=0.07393, over 4252472.53 frames. ], batch size: 62, lr: 3.69e-03, grad_scale: 32.0 2023-06-25 19:32:06,058 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.91 vs. limit=15.0 2023-06-25 19:32:19,490 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1403232.0, ans=0.2 2023-06-25 19:32:29,511 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1403292.0, ans=0.2 2023-06-25 19:32:33,893 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.980e+02 4.155e+02 6.028e+02 7.732e+02 1.561e+03, threshold=1.206e+03, percent-clipped=8.0 2023-06-25 19:32:54,231 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1403352.0, ans=0.0 2023-06-25 19:33:08,031 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1403412.0, ans=0.125 2023-06-25 19:33:31,091 INFO [train.py:996] (0/4) Epoch 8, batch 20450, loss[loss=0.2185, simple_loss=0.2875, pruned_loss=0.07471, over 21910.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.309, pruned_loss=0.07597, over 4258027.85 frames. ], batch size: 316, lr: 3.69e-03, grad_scale: 32.0 2023-06-25 19:33:38,259 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1403472.0, ans=0.125 2023-06-25 19:34:18,442 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1403592.0, ans=0.125 2023-06-25 19:34:23,404 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1403592.0, ans=0.0 2023-06-25 19:35:16,948 INFO [train.py:996] (0/4) Epoch 8, batch 20500, loss[loss=0.2303, simple_loss=0.3012, pruned_loss=0.07972, over 21760.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3049, pruned_loss=0.07633, over 4248850.68 frames. ], batch size: 124, lr: 3.69e-03, grad_scale: 16.0 2023-06-25 19:36:01,586 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1403892.0, ans=0.125 2023-06-25 19:36:07,751 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.844e+02 4.072e+02 6.125e+02 8.287e+02 1.348e+03, threshold=1.225e+03, percent-clipped=6.0 2023-06-25 19:36:34,921 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1403952.0, ans=0.125 2023-06-25 19:36:57,725 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1404012.0, ans=0.125 2023-06-25 19:37:09,459 INFO [train.py:996] (0/4) Epoch 8, batch 20550, loss[loss=0.2525, simple_loss=0.3228, pruned_loss=0.09115, over 21433.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.2969, pruned_loss=0.07458, over 4244355.33 frames. ], batch size: 473, lr: 3.69e-03, grad_scale: 16.0 2023-06-25 19:37:10,061 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1404072.0, ans=0.0 2023-06-25 19:37:10,142 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1404072.0, ans=0.2 2023-06-25 19:37:40,996 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 19:37:41,414 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.61 vs. limit=10.0 2023-06-25 19:37:48,313 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1404192.0, ans=0.0 2023-06-25 19:37:57,250 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1404192.0, ans=0.125 2023-06-25 19:38:15,070 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1404252.0, ans=0.0 2023-06-25 19:38:23,635 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 19:38:52,828 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.34 vs. limit=15.0 2023-06-25 19:38:56,968 INFO [train.py:996] (0/4) Epoch 8, batch 20600, loss[loss=0.2268, simple_loss=0.2938, pruned_loss=0.07997, over 21692.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.2974, pruned_loss=0.07222, over 4229099.22 frames. ], batch size: 263, lr: 3.69e-03, grad_scale: 16.0 2023-06-25 19:39:02,543 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1404372.0, ans=0.0 2023-06-25 19:39:29,535 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1404432.0, ans=0.05 2023-06-25 19:39:42,069 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.887e+02 4.920e+02 7.013e+02 1.215e+03 1.791e+03, threshold=1.403e+03, percent-clipped=24.0 2023-06-25 19:39:44,305 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1404492.0, ans=0.1 2023-06-25 19:40:42,123 INFO [train.py:996] (0/4) Epoch 8, batch 20650, loss[loss=0.2115, simple_loss=0.278, pruned_loss=0.07252, over 21558.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2949, pruned_loss=0.07289, over 4245456.46 frames. ], batch size: 548, lr: 3.69e-03, grad_scale: 16.0 2023-06-25 19:41:17,579 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1404732.0, ans=0.07 2023-06-25 19:41:54,943 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=1404852.0, ans=0.025 2023-06-25 19:42:09,419 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.20 vs. limit=15.0 2023-06-25 19:42:10,640 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1404912.0, ans=0.2 2023-06-25 19:42:30,249 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1404972.0, ans=0.1 2023-06-25 19:42:31,272 INFO [train.py:996] (0/4) Epoch 8, batch 20700, loss[loss=0.1756, simple_loss=0.2453, pruned_loss=0.05293, over 21237.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2881, pruned_loss=0.06989, over 4246475.19 frames. ], batch size: 176, lr: 3.69e-03, grad_scale: 8.0 2023-06-25 19:42:54,918 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1405032.0, ans=0.0 2023-06-25 19:43:27,033 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.481e+02 3.647e+02 4.600e+02 6.617e+02 1.302e+03, threshold=9.199e+02, percent-clipped=0.0 2023-06-25 19:44:27,720 INFO [train.py:996] (0/4) Epoch 8, batch 20750, loss[loss=0.2231, simple_loss=0.3333, pruned_loss=0.05644, over 20839.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2922, pruned_loss=0.06997, over 4248105.83 frames. ], batch size: 608, lr: 3.69e-03, grad_scale: 8.0 2023-06-25 19:44:38,148 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.32 vs. limit=10.0 2023-06-25 19:44:42,796 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1405272.0, ans=10.0 2023-06-25 19:44:47,908 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1405332.0, ans=0.125 2023-06-25 19:45:28,530 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1405392.0, ans=0.1 2023-06-25 19:46:04,359 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=1405512.0, ans=0.025 2023-06-25 19:46:16,219 INFO [train.py:996] (0/4) Epoch 8, batch 20800, loss[loss=0.1898, simple_loss=0.2571, pruned_loss=0.06123, over 21618.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.297, pruned_loss=0.07127, over 4253293.54 frames. ], batch size: 247, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 19:46:23,832 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1405572.0, ans=0.125 2023-06-25 19:46:49,926 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1405632.0, ans=0.125 2023-06-25 19:47:07,775 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1405692.0, ans=0.2 2023-06-25 19:47:10,353 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.060e+02 4.318e+02 7.506e+02 1.059e+03 2.434e+03, threshold=1.501e+03, percent-clipped=34.0 2023-06-25 19:48:02,819 INFO [train.py:996] (0/4) Epoch 8, batch 20850, loss[loss=0.1967, simple_loss=0.2679, pruned_loss=0.06273, over 21489.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2888, pruned_loss=0.06885, over 4261451.28 frames. ], batch size: 212, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 19:48:06,725 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1405872.0, ans=0.125 2023-06-25 19:48:09,912 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1405872.0, ans=0.1 2023-06-25 19:49:00,020 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1405992.0, ans=0.2 2023-06-25 19:49:47,593 INFO [train.py:996] (0/4) Epoch 8, batch 20900, loss[loss=0.2036, simple_loss=0.2794, pruned_loss=0.06387, over 21277.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2892, pruned_loss=0.0694, over 4271430.49 frames. ], batch size: 159, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 19:50:34,299 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.696e+02 3.719e+02 4.943e+02 7.397e+02 1.417e+03, threshold=9.886e+02, percent-clipped=0.0 2023-06-25 19:51:01,211 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1406352.0, ans=0.05 2023-06-25 19:51:30,972 INFO [train.py:996] (0/4) Epoch 8, batch 20950, loss[loss=0.1847, simple_loss=0.2662, pruned_loss=0.05161, over 21744.00 frames. ], tot_loss[loss=0.2087, simple_loss=0.2849, pruned_loss=0.06625, over 4266187.64 frames. ], batch size: 332, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 19:51:51,813 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 19:52:14,163 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1406592.0, ans=0.1 2023-06-25 19:52:42,259 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1406652.0, ans=0.2 2023-06-25 19:53:02,358 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.44 vs. limit=15.0 2023-06-25 19:53:11,659 INFO [train.py:996] (0/4) Epoch 8, batch 21000, loss[loss=0.2332, simple_loss=0.3066, pruned_loss=0.07994, over 21919.00 frames. ], tot_loss[loss=0.2103, simple_loss=0.2861, pruned_loss=0.06728, over 4259639.47 frames. ], batch size: 124, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 19:53:11,661 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-25 19:53:31,260 INFO [train.py:1028] (0/4) Epoch 8, validation: loss=0.2635, simple_loss=0.3595, pruned_loss=0.08373, over 1796401.00 frames. 2023-06-25 19:53:31,261 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23714MB 2023-06-25 19:54:24,844 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.517e+02 3.578e+02 4.486e+02 7.087e+02 1.717e+03, threshold=8.972e+02, percent-clipped=7.0 2023-06-25 19:54:33,576 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1406952.0, ans=0.2 2023-06-25 19:54:58,247 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1407012.0, ans=0.2 2023-06-25 19:55:10,830 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1407012.0, ans=0.0 2023-06-25 19:55:17,242 INFO [train.py:996] (0/4) Epoch 8, batch 21050, loss[loss=0.2176, simple_loss=0.2804, pruned_loss=0.07739, over 21167.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.2841, pruned_loss=0.06719, over 4250284.04 frames. ], batch size: 143, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 19:56:22,316 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1407252.0, ans=0.0 2023-06-25 19:56:36,097 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1407252.0, ans=0.0 2023-06-25 19:57:05,065 INFO [train.py:996] (0/4) Epoch 8, batch 21100, loss[loss=0.1939, simple_loss=0.2599, pruned_loss=0.06393, over 21774.00 frames. ], tot_loss[loss=0.2068, simple_loss=0.28, pruned_loss=0.0668, over 4253639.00 frames. ], batch size: 317, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 19:57:14,332 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1407372.0, ans=0.125 2023-06-25 19:57:14,387 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1407372.0, ans=0.1 2023-06-25 19:57:16,016 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1407372.0, ans=0.0 2023-06-25 19:57:16,492 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.35 vs. limit=22.5 2023-06-25 19:57:19,430 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1407372.0, ans=0.0 2023-06-25 19:57:38,360 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1407432.0, ans=0.0 2023-06-25 19:57:57,763 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.898e+02 4.201e+02 5.635e+02 7.939e+02 1.482e+03, threshold=1.127e+03, percent-clipped=15.0 2023-06-25 19:58:03,377 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1407492.0, ans=0.1 2023-06-25 19:58:03,425 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1407492.0, ans=0.0 2023-06-25 19:58:05,067 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1407552.0, ans=0.0 2023-06-25 19:58:49,892 INFO [train.py:996] (0/4) Epoch 8, batch 21150, loss[loss=0.1834, simple_loss=0.245, pruned_loss=0.06089, over 21662.00 frames. ], tot_loss[loss=0.2047, simple_loss=0.2764, pruned_loss=0.06647, over 4249502.54 frames. ], batch size: 282, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 19:59:19,532 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1407732.0, ans=0.2 2023-06-25 19:59:24,862 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1407732.0, ans=0.2 2023-06-25 19:59:52,505 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1407852.0, ans=0.125 2023-06-25 19:59:57,617 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1407852.0, ans=0.125 2023-06-25 20:00:35,568 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1407912.0, ans=0.07 2023-06-25 20:00:38,500 INFO [train.py:996] (0/4) Epoch 8, batch 21200, loss[loss=0.1791, simple_loss=0.2546, pruned_loss=0.0518, over 21594.00 frames. ], tot_loss[loss=0.2027, simple_loss=0.2729, pruned_loss=0.06625, over 4253371.58 frames. ], batch size: 247, lr: 3.68e-03, grad_scale: 32.0 2023-06-25 20:01:02,392 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1408032.0, ans=0.0 2023-06-25 20:01:29,061 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.76 vs. limit=10.0 2023-06-25 20:01:34,476 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.940e+02 3.823e+02 4.703e+02 6.840e+02 1.518e+03, threshold=9.406e+02, percent-clipped=1.0 2023-06-25 20:02:26,001 INFO [train.py:996] (0/4) Epoch 8, batch 21250, loss[loss=0.1894, simple_loss=0.2544, pruned_loss=0.06221, over 21660.00 frames. ], tot_loss[loss=0.2014, simple_loss=0.2707, pruned_loss=0.06607, over 4253902.64 frames. ], batch size: 282, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 20:02:36,703 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1408272.0, ans=0.1 2023-06-25 20:03:10,024 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 20:03:12,274 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.00 vs. limit=12.0 2023-06-25 20:04:11,923 INFO [train.py:996] (0/4) Epoch 8, batch 21300, loss[loss=0.2329, simple_loss=0.3075, pruned_loss=0.07915, over 21878.00 frames. ], tot_loss[loss=0.2071, simple_loss=0.2774, pruned_loss=0.0684, over 4266951.76 frames. ], batch size: 415, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 20:04:14,688 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1408572.0, ans=0.1 2023-06-25 20:04:38,706 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.38 vs. limit=6.0 2023-06-25 20:04:49,975 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1408632.0, ans=0.1 2023-06-25 20:05:00,145 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1408692.0, ans=0.125 2023-06-25 20:05:07,867 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.197e+02 4.370e+02 6.934e+02 9.057e+02 1.727e+03, threshold=1.387e+03, percent-clipped=23.0 2023-06-25 20:05:27,828 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.83 vs. limit=15.0 2023-06-25 20:05:44,928 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.83 vs. limit=15.0 2023-06-25 20:05:53,332 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.26 vs. limit=12.0 2023-06-25 20:05:56,083 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1408812.0, ans=0.0 2023-06-25 20:05:58,763 INFO [train.py:996] (0/4) Epoch 8, batch 21350, loss[loss=0.1946, simple_loss=0.2692, pruned_loss=0.06, over 21347.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2817, pruned_loss=0.06933, over 4273880.42 frames. ], batch size: 131, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 20:06:06,121 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1408872.0, ans=0.125 2023-06-25 20:06:07,970 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1408872.0, ans=0.125 2023-06-25 20:06:24,326 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.22 vs. limit=22.5 2023-06-25 20:07:10,274 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1409052.0, ans=0.125 2023-06-25 20:07:25,840 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1409052.0, ans=0.0 2023-06-25 20:07:45,971 INFO [train.py:996] (0/4) Epoch 8, batch 21400, loss[loss=0.2373, simple_loss=0.3153, pruned_loss=0.07969, over 21936.00 frames. ], tot_loss[loss=0.2103, simple_loss=0.2844, pruned_loss=0.06809, over 4281875.35 frames. ], batch size: 316, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 20:08:46,538 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.047e+02 3.806e+02 5.030e+02 6.995e+02 1.894e+03, threshold=1.006e+03, percent-clipped=5.0 2023-06-25 20:09:08,167 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=7.76 vs. limit=12.0 2023-06-25 20:09:32,616 INFO [train.py:996] (0/4) Epoch 8, batch 21450, loss[loss=0.2286, simple_loss=0.3002, pruned_loss=0.0785, over 21466.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2893, pruned_loss=0.07045, over 4287807.11 frames. ], batch size: 211, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 20:10:46,572 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1409652.0, ans=0.07 2023-06-25 20:11:10,629 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1409712.0, ans=0.0 2023-06-25 20:11:19,618 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1409772.0, ans=0.0 2023-06-25 20:11:20,625 INFO [train.py:996] (0/4) Epoch 8, batch 21500, loss[loss=0.2094, simple_loss=0.2745, pruned_loss=0.07215, over 21740.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2876, pruned_loss=0.07075, over 4293072.11 frames. ], batch size: 112, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 20:11:59,357 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1409832.0, ans=0.0 2023-06-25 20:12:17,579 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=1409892.0, ans=10.0 2023-06-25 20:12:25,074 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.002e+02 3.682e+02 4.429e+02 6.594e+02 1.934e+03, threshold=8.857e+02, percent-clipped=12.0 2023-06-25 20:12:29,096 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1409892.0, ans=0.125 2023-06-25 20:12:35,745 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1409952.0, ans=0.125 2023-06-25 20:12:56,396 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=6.73 vs. limit=22.5 2023-06-25 20:12:59,054 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1410012.0, ans=0.125 2023-06-25 20:13:05,244 INFO [train.py:996] (0/4) Epoch 8, batch 21550, loss[loss=0.1574, simple_loss=0.2254, pruned_loss=0.04466, over 21245.00 frames. ], tot_loss[loss=0.2094, simple_loss=0.2808, pruned_loss=0.06899, over 4286032.65 frames. ], batch size: 176, lr: 3.68e-03, grad_scale: 8.0 2023-06-25 20:14:43,654 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 20:14:45,471 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1410312.0, ans=0.125 2023-06-25 20:14:53,573 INFO [train.py:996] (0/4) Epoch 8, batch 21600, loss[loss=0.168, simple_loss=0.2378, pruned_loss=0.04906, over 21485.00 frames. ], tot_loss[loss=0.2044, simple_loss=0.2757, pruned_loss=0.06651, over 4283833.03 frames. ], batch size: 230, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 20:15:56,638 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.22 vs. limit=15.0 2023-06-25 20:16:02,283 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.883e+02 3.709e+02 4.996e+02 7.825e+02 2.196e+03, threshold=9.991e+02, percent-clipped=18.0 2023-06-25 20:16:10,125 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 20:16:25,483 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1410612.0, ans=0.1 2023-06-25 20:16:46,717 INFO [train.py:996] (0/4) Epoch 8, batch 21650, loss[loss=0.2214, simple_loss=0.3035, pruned_loss=0.0696, over 21229.00 frames. ], tot_loss[loss=0.2062, simple_loss=0.2819, pruned_loss=0.06525, over 4285671.99 frames. ], batch size: 176, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 20:16:52,157 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1410672.0, ans=0.0 2023-06-25 20:17:30,587 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1410792.0, ans=0.05 2023-06-25 20:17:42,642 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1410792.0, ans=0.125 2023-06-25 20:17:59,468 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1410852.0, ans=0.125 2023-06-25 20:18:00,140 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.59 vs. limit=15.0 2023-06-25 20:18:14,778 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1410912.0, ans=0.1 2023-06-25 20:18:25,976 INFO [train.py:996] (0/4) Epoch 8, batch 21700, loss[loss=0.2005, simple_loss=0.2723, pruned_loss=0.06438, over 21719.00 frames. ], tot_loss[loss=0.2057, simple_loss=0.2828, pruned_loss=0.06428, over 4287107.88 frames. ], batch size: 282, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 20:18:44,041 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1410972.0, ans=0.2 2023-06-25 20:19:33,110 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.665e+02 3.626e+02 5.313e+02 7.928e+02 1.804e+03, threshold=1.063e+03, percent-clipped=12.0 2023-06-25 20:19:37,142 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1411152.0, ans=0.05 2023-06-25 20:19:39,401 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.14 vs. limit=15.0 2023-06-25 20:20:01,662 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1411212.0, ans=0.125 2023-06-25 20:20:12,869 INFO [train.py:996] (0/4) Epoch 8, batch 21750, loss[loss=0.189, simple_loss=0.2445, pruned_loss=0.06677, over 21196.00 frames. ], tot_loss[loss=0.204, simple_loss=0.279, pruned_loss=0.06454, over 4274176.06 frames. ], batch size: 549, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 20:20:13,604 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1411272.0, ans=0.0 2023-06-25 20:20:29,561 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=1411272.0, ans=0.5 2023-06-25 20:20:31,695 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.33 vs. limit=15.0 2023-06-25 20:20:59,530 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1411332.0, ans=0.1 2023-06-25 20:21:06,654 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.35 vs. limit=15.0 2023-06-25 20:21:27,107 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1411452.0, ans=0.1 2023-06-25 20:21:37,378 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1411452.0, ans=0.125 2023-06-25 20:21:39,109 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1411452.0, ans=0.125 2023-06-25 20:21:45,351 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.39 vs. limit=15.0 2023-06-25 20:21:46,888 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1411512.0, ans=0.125 2023-06-25 20:22:07,456 INFO [train.py:996] (0/4) Epoch 8, batch 21800, loss[loss=0.202, simple_loss=0.2718, pruned_loss=0.0661, over 21961.00 frames. ], tot_loss[loss=0.2034, simple_loss=0.2763, pruned_loss=0.06523, over 4278830.52 frames. ], batch size: 103, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 20:22:41,565 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1411632.0, ans=0.2 2023-06-25 20:22:59,360 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.60 vs. limit=12.0 2023-06-25 20:23:07,337 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1411692.0, ans=0.125 2023-06-25 20:23:10,287 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.928e+02 3.851e+02 5.673e+02 8.450e+02 2.187e+03, threshold=1.135e+03, percent-clipped=14.0 2023-06-25 20:23:19,869 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1411752.0, ans=0.1 2023-06-25 20:23:54,653 INFO [train.py:996] (0/4) Epoch 8, batch 21850, loss[loss=0.2261, simple_loss=0.3311, pruned_loss=0.06055, over 21605.00 frames. ], tot_loss[loss=0.2053, simple_loss=0.2799, pruned_loss=0.06532, over 4273949.09 frames. ], batch size: 389, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 20:24:03,083 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.96 vs. limit=15.0 2023-06-25 20:24:33,483 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1411932.0, ans=0.125 2023-06-25 20:25:43,127 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1412172.0, ans=0.2 2023-06-25 20:25:43,758 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.41 vs. limit=22.5 2023-06-25 20:25:44,263 INFO [train.py:996] (0/4) Epoch 8, batch 21900, loss[loss=0.2045, simple_loss=0.2746, pruned_loss=0.06726, over 21797.00 frames. ], tot_loss[loss=0.2076, simple_loss=0.2814, pruned_loss=0.06689, over 4267790.14 frames. ], batch size: 118, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 20:26:14,222 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1412232.0, ans=0.0 2023-06-25 20:26:36,696 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1412292.0, ans=0.125 2023-06-25 20:26:41,793 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1412292.0, ans=0.125 2023-06-25 20:26:46,375 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.095e+02 4.152e+02 5.797e+02 7.520e+02 1.468e+03, threshold=1.159e+03, percent-clipped=2.0 2023-06-25 20:27:16,552 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1412412.0, ans=0.1 2023-06-25 20:27:35,947 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.92 vs. limit=6.0 2023-06-25 20:27:36,372 INFO [train.py:996] (0/4) Epoch 8, batch 21950, loss[loss=0.1921, simple_loss=0.2604, pruned_loss=0.06189, over 21584.00 frames. ], tot_loss[loss=0.2046, simple_loss=0.2766, pruned_loss=0.06629, over 4271831.26 frames. ], batch size: 263, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 20:27:43,846 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1412472.0, ans=0.125 2023-06-25 20:28:20,133 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1412592.0, ans=0.07 2023-06-25 20:28:31,284 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.50 vs. limit=22.5 2023-06-25 20:28:39,236 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1412652.0, ans=0.1 2023-06-25 20:28:54,051 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.97 vs. limit=6.0 2023-06-25 20:29:05,325 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.71 vs. limit=12.0 2023-06-25 20:29:25,471 INFO [train.py:996] (0/4) Epoch 8, batch 22000, loss[loss=0.2051, simple_loss=0.2731, pruned_loss=0.06852, over 21607.00 frames. ], tot_loss[loss=0.1987, simple_loss=0.2706, pruned_loss=0.06337, over 4266291.40 frames. ], batch size: 415, lr: 3.68e-03, grad_scale: 32.0 2023-06-25 20:30:23,053 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.352e+02 3.905e+02 5.232e+02 7.810e+02 2.335e+03, threshold=1.046e+03, percent-clipped=14.0 2023-06-25 20:30:36,587 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.68 vs. limit=15.0 2023-06-25 20:31:13,835 INFO [train.py:996] (0/4) Epoch 8, batch 22050, loss[loss=0.2349, simple_loss=0.3277, pruned_loss=0.07107, over 21783.00 frames. ], tot_loss[loss=0.2045, simple_loss=0.2778, pruned_loss=0.06562, over 4264422.67 frames. ], batch size: 351, lr: 3.67e-03, grad_scale: 32.0 2023-06-25 20:32:52,944 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1413312.0, ans=0.1 2023-06-25 20:32:52,985 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1413312.0, ans=0.1 2023-06-25 20:33:02,604 INFO [train.py:996] (0/4) Epoch 8, batch 22100, loss[loss=0.2752, simple_loss=0.331, pruned_loss=0.1097, over 21584.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2862, pruned_loss=0.06959, over 4264539.89 frames. ], batch size: 471, lr: 3.67e-03, grad_scale: 16.0 2023-06-25 20:33:20,097 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1413372.0, ans=0.125 2023-06-25 20:34:00,127 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.256e+02 4.552e+02 6.727e+02 1.040e+03 2.213e+03, threshold=1.345e+03, percent-clipped=23.0 2023-06-25 20:34:47,946 INFO [train.py:996] (0/4) Epoch 8, batch 22150, loss[loss=0.2095, simple_loss=0.2895, pruned_loss=0.06469, over 21418.00 frames. ], tot_loss[loss=0.215, simple_loss=0.2888, pruned_loss=0.07062, over 4276507.61 frames. ], batch size: 131, lr: 3.67e-03, grad_scale: 16.0 2023-06-25 20:34:55,306 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1413672.0, ans=0.125 2023-06-25 20:35:10,506 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1413672.0, ans=0.0 2023-06-25 20:35:42,265 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1413792.0, ans=0.1 2023-06-25 20:35:52,776 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.51 vs. limit=22.5 2023-06-25 20:36:35,736 INFO [train.py:996] (0/4) Epoch 8, batch 22200, loss[loss=0.231, simple_loss=0.3143, pruned_loss=0.07386, over 21257.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2922, pruned_loss=0.07193, over 4284251.31 frames. ], batch size: 159, lr: 3.67e-03, grad_scale: 16.0 2023-06-25 20:37:06,941 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=8.25 vs. limit=12.0 2023-06-25 20:37:28,641 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1414092.0, ans=0.035 2023-06-25 20:37:29,742 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.047e+02 4.294e+02 5.583e+02 8.306e+02 1.665e+03, threshold=1.117e+03, percent-clipped=3.0 2023-06-25 20:38:05,259 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1414212.0, ans=0.125 2023-06-25 20:38:13,950 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1414212.0, ans=0.125 2023-06-25 20:38:15,680 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1414212.0, ans=0.0 2023-06-25 20:38:23,334 INFO [train.py:996] (0/4) Epoch 8, batch 22250, loss[loss=0.2704, simple_loss=0.3506, pruned_loss=0.09508, over 21822.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.299, pruned_loss=0.07331, over 4285609.25 frames. ], batch size: 118, lr: 3.67e-03, grad_scale: 16.0 2023-06-25 20:38:34,369 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1414272.0, ans=0.04949747468305833 2023-06-25 20:38:47,778 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1414332.0, ans=0.125 2023-06-25 20:39:04,005 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.27 vs. limit=15.0 2023-06-25 20:39:07,059 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1414392.0, ans=0.125 2023-06-25 20:39:31,227 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1414452.0, ans=0.025 2023-06-25 20:39:43,123 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1414452.0, ans=0.035 2023-06-25 20:39:52,838 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1414512.0, ans=0.2 2023-06-25 20:39:54,660 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1414512.0, ans=0.125 2023-06-25 20:40:04,035 INFO [train.py:996] (0/4) Epoch 8, batch 22300, loss[loss=0.2025, simple_loss=0.272, pruned_loss=0.0665, over 21693.00 frames. ], tot_loss[loss=0.225, simple_loss=0.3, pruned_loss=0.07503, over 4291550.15 frames. ], batch size: 230, lr: 3.67e-03, grad_scale: 16.0 2023-06-25 20:40:50,672 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1414692.0, ans=0.125 2023-06-25 20:40:57,268 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.146e+02 4.093e+02 5.360e+02 7.335e+02 1.399e+03, threshold=1.072e+03, percent-clipped=5.0 2023-06-25 20:41:36,701 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 20:41:51,928 INFO [train.py:996] (0/4) Epoch 8, batch 22350, loss[loss=0.2274, simple_loss=0.3059, pruned_loss=0.07448, over 21737.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.2981, pruned_loss=0.07537, over 4298784.58 frames. ], batch size: 112, lr: 3.67e-03, grad_scale: 16.0 2023-06-25 20:42:12,155 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1414932.0, ans=0.04949747468305833 2023-06-25 20:42:15,515 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1414932.0, ans=0.125 2023-06-25 20:42:15,594 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1414932.0, ans=0.1 2023-06-25 20:43:38,701 INFO [train.py:996] (0/4) Epoch 8, batch 22400, loss[loss=0.2298, simple_loss=0.2949, pruned_loss=0.08231, over 21510.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.295, pruned_loss=0.07295, over 4291147.83 frames. ], batch size: 441, lr: 3.67e-03, grad_scale: 32.0 2023-06-25 20:43:51,725 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1415172.0, ans=0.05 2023-06-25 20:44:03,570 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1415232.0, ans=0.05 2023-06-25 20:44:04,038 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=1415232.0, ans=15.0 2023-06-25 20:44:34,053 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.012e+02 4.038e+02 6.138e+02 7.809e+02 1.292e+03, threshold=1.228e+03, percent-clipped=3.0 2023-06-25 20:44:34,907 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1415352.0, ans=0.05 2023-06-25 20:44:43,901 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1415352.0, ans=0.125 2023-06-25 20:45:25,837 INFO [train.py:996] (0/4) Epoch 8, batch 22450, loss[loss=0.2313, simple_loss=0.2951, pruned_loss=0.08368, over 19979.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2889, pruned_loss=0.07165, over 4284106.34 frames. ], batch size: 702, lr: 3.67e-03, grad_scale: 16.0 2023-06-25 20:45:47,369 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1415532.0, ans=0.05 2023-06-25 20:45:49,311 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1415532.0, ans=0.1 2023-06-25 20:46:07,222 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.03 vs. limit=12.0 2023-06-25 20:46:37,832 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1415652.0, ans=0.125 2023-06-25 20:47:02,358 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1415712.0, ans=0.125 2023-06-25 20:47:12,144 INFO [train.py:996] (0/4) Epoch 8, batch 22500, loss[loss=0.1904, simple_loss=0.2676, pruned_loss=0.05659, over 21783.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2842, pruned_loss=0.07072, over 4277515.40 frames. ], batch size: 124, lr: 3.67e-03, grad_scale: 8.0 2023-06-25 20:47:16,880 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.20 vs. limit=15.0 2023-06-25 20:47:24,918 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1415772.0, ans=0.1 2023-06-25 20:47:27,148 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.90 vs. limit=6.0 2023-06-25 20:48:14,385 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.749e+02 3.949e+02 4.919e+02 7.887e+02 2.030e+03, threshold=9.838e+02, percent-clipped=13.0 2023-06-25 20:48:31,365 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-236000.pt 2023-06-25 20:48:42,889 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.92 vs. limit=5.0 2023-06-25 20:48:46,361 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.58 vs. limit=6.0 2023-06-25 20:49:01,280 INFO [train.py:996] (0/4) Epoch 8, batch 22550, loss[loss=0.2012, simple_loss=0.2727, pruned_loss=0.06482, over 21553.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2878, pruned_loss=0.07092, over 4278075.23 frames. ], batch size: 212, lr: 3.67e-03, grad_scale: 8.0 2023-06-25 20:50:02,712 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1416192.0, ans=0.0 2023-06-25 20:50:28,896 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.32 vs. limit=15.0 2023-06-25 20:50:52,276 INFO [train.py:996] (0/4) Epoch 8, batch 22600, loss[loss=0.1993, simple_loss=0.2852, pruned_loss=0.05668, over 21811.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2898, pruned_loss=0.07131, over 4282553.17 frames. ], batch size: 316, lr: 3.67e-03, grad_scale: 8.0 2023-06-25 20:51:28,236 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1416432.0, ans=0.125 2023-06-25 20:51:37,049 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1416492.0, ans=0.1 2023-06-25 20:52:04,649 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.075e+02 4.518e+02 6.028e+02 9.364e+02 1.882e+03, threshold=1.206e+03, percent-clipped=21.0 2023-06-25 20:52:06,085 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.40 vs. limit=22.5 2023-06-25 20:52:36,009 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1416612.0, ans=0.0 2023-06-25 20:52:38,697 INFO [train.py:996] (0/4) Epoch 8, batch 22650, loss[loss=0.206, simple_loss=0.2704, pruned_loss=0.07077, over 21870.00 frames. ], tot_loss[loss=0.2142, simple_loss=0.287, pruned_loss=0.07067, over 4278717.00 frames. ], batch size: 98, lr: 3.67e-03, grad_scale: 8.0 2023-06-25 20:52:58,146 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1416732.0, ans=0.125 2023-06-25 20:53:28,134 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.65 vs. limit=15.0 2023-06-25 20:54:14,494 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1416912.0, ans=0.125 2023-06-25 20:54:14,507 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1416912.0, ans=0.125 2023-06-25 20:54:20,834 INFO [train.py:996] (0/4) Epoch 8, batch 22700, loss[loss=0.2179, simple_loss=0.2785, pruned_loss=0.07869, over 21347.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.2808, pruned_loss=0.07012, over 4271539.27 frames. ], batch size: 160, lr: 3.67e-03, grad_scale: 8.0 2023-06-25 20:54:30,454 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1416972.0, ans=0.125 2023-06-25 20:54:35,451 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1416972.0, ans=0.125 2023-06-25 20:55:33,799 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.950e+02 3.999e+02 5.550e+02 8.694e+02 1.659e+03, threshold=1.110e+03, percent-clipped=6.0 2023-06-25 20:55:45,340 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1417152.0, ans=0.0 2023-06-25 20:56:08,928 INFO [train.py:996] (0/4) Epoch 8, batch 22750, loss[loss=0.2181, simple_loss=0.3197, pruned_loss=0.05826, over 19748.00 frames. ], tot_loss[loss=0.2129, simple_loss=0.2825, pruned_loss=0.07166, over 4278374.94 frames. ], batch size: 703, lr: 3.67e-03, grad_scale: 8.0 2023-06-25 20:57:25,572 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1417452.0, ans=0.125 2023-06-25 20:57:32,984 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.76 vs. limit=15.0 2023-06-25 20:57:38,970 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 20:57:50,900 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1417512.0, ans=0.0 2023-06-25 20:57:55,438 INFO [train.py:996] (0/4) Epoch 8, batch 22800, loss[loss=0.21, simple_loss=0.2768, pruned_loss=0.07155, over 21677.00 frames. ], tot_loss[loss=0.218, simple_loss=0.288, pruned_loss=0.07404, over 4287955.40 frames. ], batch size: 263, lr: 3.67e-03, grad_scale: 16.0 2023-06-25 20:58:02,668 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1417572.0, ans=0.0 2023-06-25 20:58:59,230 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1417692.0, ans=0.2 2023-06-25 20:59:06,885 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.216e+02 4.609e+02 5.638e+02 8.633e+02 1.980e+03, threshold=1.128e+03, percent-clipped=10.0 2023-06-25 20:59:41,020 INFO [train.py:996] (0/4) Epoch 8, batch 22850, loss[loss=0.1947, simple_loss=0.2589, pruned_loss=0.06527, over 21380.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.285, pruned_loss=0.07336, over 4292941.84 frames. ], batch size: 144, lr: 3.67e-03, grad_scale: 16.0 2023-06-25 20:59:51,887 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1417872.0, ans=0.2 2023-06-25 21:00:53,604 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1418052.0, ans=0.125 2023-06-25 21:01:12,022 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1418112.0, ans=0.125 2023-06-25 21:01:22,736 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1418112.0, ans=0.125 2023-06-25 21:01:23,331 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.63 vs. limit=15.0 2023-06-25 21:01:30,633 INFO [train.py:996] (0/4) Epoch 8, batch 22900, loss[loss=0.2308, simple_loss=0.3468, pruned_loss=0.05743, over 21194.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.288, pruned_loss=0.07271, over 4284706.80 frames. ], batch size: 548, lr: 3.67e-03, grad_scale: 16.0 2023-06-25 21:02:04,138 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1418232.0, ans=0.0 2023-06-25 21:02:20,347 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1418232.0, ans=0.125 2023-06-25 21:02:28,887 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1418292.0, ans=0.0 2023-06-25 21:02:45,754 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.594e+02 4.620e+02 6.877e+02 1.071e+03 2.318e+03, threshold=1.375e+03, percent-clipped=23.0 2023-06-25 21:03:00,316 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 21:03:04,429 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.97 vs. limit=15.0 2023-06-25 21:03:25,359 INFO [train.py:996] (0/4) Epoch 8, batch 22950, loss[loss=0.252, simple_loss=0.3832, pruned_loss=0.0604, over 21339.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.3001, pruned_loss=0.07135, over 4276330.05 frames. ], batch size: 548, lr: 3.67e-03, grad_scale: 16.0 2023-06-25 21:04:16,989 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1418592.0, ans=0.0 2023-06-25 21:05:00,714 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1418712.0, ans=0.125 2023-06-25 21:05:12,931 INFO [train.py:996] (0/4) Epoch 8, batch 23000, loss[loss=0.2137, simple_loss=0.3093, pruned_loss=0.05908, over 20036.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.3009, pruned_loss=0.07009, over 4279847.56 frames. ], batch size: 703, lr: 3.67e-03, grad_scale: 16.0 2023-06-25 21:06:07,832 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1418892.0, ans=0.2 2023-06-25 21:06:10,511 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.651e+02 4.060e+02 5.403e+02 8.584e+02 1.736e+03, threshold=1.081e+03, percent-clipped=10.0 2023-06-25 21:06:55,857 INFO [train.py:996] (0/4) Epoch 8, batch 23050, loss[loss=0.2224, simple_loss=0.3025, pruned_loss=0.07116, over 20798.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.3013, pruned_loss=0.07152, over 4272627.87 frames. ], batch size: 607, lr: 3.67e-03, grad_scale: 16.0 2023-06-25 21:06:58,307 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1419072.0, ans=0.125 2023-06-25 21:07:00,870 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=1419072.0, ans=6.0 2023-06-25 21:07:13,975 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1419072.0, ans=0.125 2023-06-25 21:07:19,780 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.11 vs. limit=15.0 2023-06-25 21:07:31,199 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1419132.0, ans=0.125 2023-06-25 21:07:36,542 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1419132.0, ans=0.2 2023-06-25 21:07:50,162 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1419192.0, ans=0.125 2023-06-25 21:07:50,244 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1419192.0, ans=0.1 2023-06-25 21:08:30,140 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.87 vs. limit=15.0 2023-06-25 21:08:38,086 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1419312.0, ans=0.1 2023-06-25 21:08:42,769 INFO [train.py:996] (0/4) Epoch 8, batch 23100, loss[loss=0.1846, simple_loss=0.2479, pruned_loss=0.0607, over 21597.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.2977, pruned_loss=0.07289, over 4261665.08 frames. ], batch size: 231, lr: 3.67e-03, grad_scale: 16.0 2023-06-25 21:08:44,217 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.25 vs. limit=15.0 2023-06-25 21:09:06,256 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.25 vs. limit=12.0 2023-06-25 21:09:44,365 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.818e+02 4.180e+02 5.701e+02 8.990e+02 1.720e+03, threshold=1.140e+03, percent-clipped=10.0 2023-06-25 21:10:04,871 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1419552.0, ans=0.125 2023-06-25 21:10:30,272 INFO [train.py:996] (0/4) Epoch 8, batch 23150, loss[loss=0.1977, simple_loss=0.2671, pruned_loss=0.06414, over 21200.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2915, pruned_loss=0.07209, over 4259177.22 frames. ], batch size: 548, lr: 3.67e-03, grad_scale: 16.0 2023-06-25 21:11:00,404 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1419732.0, ans=0.125 2023-06-25 21:11:19,815 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.45 vs. limit=15.0 2023-06-25 21:11:52,984 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1419852.0, ans=0.125 2023-06-25 21:12:17,936 INFO [train.py:996] (0/4) Epoch 8, batch 23200, loss[loss=0.2177, simple_loss=0.3002, pruned_loss=0.06758, over 21372.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2888, pruned_loss=0.07185, over 4262079.92 frames. ], batch size: 131, lr: 3.67e-03, grad_scale: 32.0 2023-06-25 21:12:44,167 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1420032.0, ans=0.2 2023-06-25 21:13:19,456 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.122e+02 4.151e+02 5.652e+02 8.200e+02 1.593e+03, threshold=1.130e+03, percent-clipped=6.0 2023-06-25 21:13:41,735 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1420212.0, ans=0.0 2023-06-25 21:13:59,487 INFO [train.py:996] (0/4) Epoch 8, batch 23250, loss[loss=0.2331, simple_loss=0.2969, pruned_loss=0.08462, over 21859.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2895, pruned_loss=0.07286, over 4274712.46 frames. ], batch size: 414, lr: 3.67e-03, grad_scale: 16.0 2023-06-25 21:14:18,584 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1420272.0, ans=0.05 2023-06-25 21:14:57,255 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.88 vs. limit=22.5 2023-06-25 21:15:00,418 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1420452.0, ans=0.04949747468305833 2023-06-25 21:15:25,653 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1420512.0, ans=0.0 2023-06-25 21:15:52,824 INFO [train.py:996] (0/4) Epoch 8, batch 23300, loss[loss=0.3142, simple_loss=0.415, pruned_loss=0.1066, over 21527.00 frames. ], tot_loss[loss=0.224, simple_loss=0.2988, pruned_loss=0.07461, over 4283192.37 frames. ], batch size: 471, lr: 3.67e-03, grad_scale: 16.0 2023-06-25 21:16:03,977 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1420572.0, ans=0.2 2023-06-25 21:16:09,437 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1420632.0, ans=0.2 2023-06-25 21:16:57,818 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.217e+02 4.429e+02 5.607e+02 7.442e+02 1.718e+03, threshold=1.121e+03, percent-clipped=5.0 2023-06-25 21:17:23,991 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.56 vs. limit=15.0 2023-06-25 21:17:35,087 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1420812.0, ans=0.125 2023-06-25 21:17:41,346 INFO [train.py:996] (0/4) Epoch 8, batch 23350, loss[loss=0.1734, simple_loss=0.2599, pruned_loss=0.04338, over 21686.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.3018, pruned_loss=0.07337, over 4275380.50 frames. ], batch size: 298, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 21:17:47,352 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1420872.0, ans=0.1 2023-06-25 21:18:07,006 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1420932.0, ans=0.125 2023-06-25 21:18:08,376 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1420932.0, ans=0.125 2023-06-25 21:18:10,727 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.90 vs. limit=15.0 2023-06-25 21:18:37,453 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1420992.0, ans=0.125 2023-06-25 21:18:55,867 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1421052.0, ans=0.0 2023-06-25 21:19:29,497 INFO [train.py:996] (0/4) Epoch 8, batch 23400, loss[loss=0.2041, simple_loss=0.2765, pruned_loss=0.06586, over 21478.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2952, pruned_loss=0.06955, over 4283872.80 frames. ], batch size: 211, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 21:19:39,310 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.66 vs. limit=22.5 2023-06-25 21:19:54,349 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1421232.0, ans=0.125 2023-06-25 21:19:54,352 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1421232.0, ans=0.2 2023-06-25 21:20:34,186 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.708e+02 4.466e+02 6.262e+02 8.598e+02 1.529e+03, threshold=1.252e+03, percent-clipped=12.0 2023-06-25 21:20:36,797 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1421352.0, ans=0.0 2023-06-25 21:21:17,392 INFO [train.py:996] (0/4) Epoch 8, batch 23450, loss[loss=0.2691, simple_loss=0.338, pruned_loss=0.1001, over 21350.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.2968, pruned_loss=0.07254, over 4284049.61 frames. ], batch size: 143, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 21:21:40,488 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1421532.0, ans=0.0 2023-06-25 21:21:43,529 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1421532.0, ans=0.125 2023-06-25 21:22:09,815 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.04 vs. limit=6.0 2023-06-25 21:22:23,737 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1421652.0, ans=0.04949747468305833 2023-06-25 21:23:04,844 INFO [train.py:996] (0/4) Epoch 8, batch 23500, loss[loss=0.2133, simple_loss=0.2851, pruned_loss=0.07073, over 21789.00 frames. ], tot_loss[loss=0.224, simple_loss=0.2989, pruned_loss=0.07458, over 4282219.94 frames. ], batch size: 389, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 21:23:25,668 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 21:23:55,280 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.55 vs. limit=15.0 2023-06-25 21:24:05,034 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1421952.0, ans=0.125 2023-06-25 21:24:07,690 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.437e+02 4.437e+02 5.920e+02 8.678e+02 1.556e+03, threshold=1.184e+03, percent-clipped=4.0 2023-06-25 21:24:09,951 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1421952.0, ans=0.2 2023-06-25 21:24:43,688 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.98 vs. limit=10.0 2023-06-25 21:24:50,817 INFO [train.py:996] (0/4) Epoch 8, batch 23550, loss[loss=0.2309, simple_loss=0.2797, pruned_loss=0.09105, over 21382.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.2948, pruned_loss=0.07398, over 4265280.27 frames. ], batch size: 473, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 21:24:58,013 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1422072.0, ans=0.125 2023-06-25 21:25:38,523 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1422192.0, ans=0.0 2023-06-25 21:26:14,056 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.12 vs. limit=15.0 2023-06-25 21:26:34,228 INFO [train.py:996] (0/4) Epoch 8, batch 23600, loss[loss=0.2532, simple_loss=0.3248, pruned_loss=0.09081, over 21870.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.294, pruned_loss=0.07427, over 4258297.64 frames. ], batch size: 371, lr: 3.66e-03, grad_scale: 32.0 2023-06-25 21:27:26,596 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1422492.0, ans=0.125 2023-06-25 21:27:43,436 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.whiten.whitening_limit, batch_count=1422552.0, ans=12.0 2023-06-25 21:27:45,449 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.060e+02 4.385e+02 5.770e+02 8.074e+02 1.431e+03, threshold=1.154e+03, percent-clipped=6.0 2023-06-25 21:27:51,704 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1422552.0, ans=0.0 2023-06-25 21:28:17,981 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1422672.0, ans=0.125 2023-06-25 21:28:19,022 INFO [train.py:996] (0/4) Epoch 8, batch 23650, loss[loss=0.2016, simple_loss=0.2796, pruned_loss=0.06183, over 20102.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.2941, pruned_loss=0.07216, over 4260331.65 frames. ], batch size: 702, lr: 3.66e-03, grad_scale: 32.0 2023-06-25 21:28:43,028 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1422672.0, ans=0.2 2023-06-25 21:28:44,643 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1422672.0, ans=0.0 2023-06-25 21:30:12,788 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1422912.0, ans=10.0 2023-06-25 21:30:15,686 INFO [train.py:996] (0/4) Epoch 8, batch 23700, loss[loss=0.2256, simple_loss=0.3018, pruned_loss=0.07467, over 21401.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.2966, pruned_loss=0.07196, over 4267911.84 frames. ], batch size: 176, lr: 3.66e-03, grad_scale: 32.0 2023-06-25 21:31:01,412 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1423092.0, ans=0.0 2023-06-25 21:31:21,928 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.069e+02 4.706e+02 7.567e+02 1.059e+03 2.312e+03, threshold=1.513e+03, percent-clipped=21.0 2023-06-25 21:31:51,078 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1423212.0, ans=0.1 2023-06-25 21:31:57,860 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1423212.0, ans=0.0 2023-06-25 21:32:05,919 INFO [train.py:996] (0/4) Epoch 8, batch 23750, loss[loss=0.1983, simple_loss=0.3044, pruned_loss=0.04607, over 21628.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.298, pruned_loss=0.07178, over 4264139.87 frames. ], batch size: 414, lr: 3.66e-03, grad_scale: 32.0 2023-06-25 21:33:54,140 INFO [train.py:996] (0/4) Epoch 8, batch 23800, loss[loss=0.2016, simple_loss=0.281, pruned_loss=0.06111, over 21366.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2962, pruned_loss=0.06945, over 4266379.96 frames. ], batch size: 131, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 21:34:06,646 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1423572.0, ans=0.125 2023-06-25 21:34:59,936 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1423692.0, ans=0.05 2023-06-25 21:35:08,036 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.992e+02 4.494e+02 6.635e+02 8.945e+02 1.790e+03, threshold=1.327e+03, percent-clipped=2.0 2023-06-25 21:35:19,046 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1423752.0, ans=0.1 2023-06-25 21:35:29,968 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1423812.0, ans=0.125 2023-06-25 21:35:50,938 INFO [train.py:996] (0/4) Epoch 8, batch 23850, loss[loss=0.239, simple_loss=0.3145, pruned_loss=0.0817, over 21447.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.3037, pruned_loss=0.07185, over 4267989.15 frames. ], batch size: 549, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 21:36:09,792 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1423932.0, ans=0.2 2023-06-25 21:36:13,257 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1423932.0, ans=0.0 2023-06-25 21:37:40,723 INFO [train.py:996] (0/4) Epoch 8, batch 23900, loss[loss=0.2214, simple_loss=0.3099, pruned_loss=0.06641, over 21655.00 frames. ], tot_loss[loss=0.228, simple_loss=0.3092, pruned_loss=0.07339, over 4270085.18 frames. ], batch size: 332, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 21:37:44,997 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 21:38:26,414 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1424292.0, ans=0.125 2023-06-25 21:38:27,134 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.84 vs. limit=15.0 2023-06-25 21:38:41,566 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.117e+02 4.954e+02 6.480e+02 8.834e+02 1.664e+03, threshold=1.296e+03, percent-clipped=3.0 2023-06-25 21:38:59,498 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1424412.0, ans=0.125 2023-06-25 21:39:23,118 INFO [train.py:996] (0/4) Epoch 8, batch 23950, loss[loss=0.2169, simple_loss=0.2894, pruned_loss=0.07223, over 21933.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.3041, pruned_loss=0.07274, over 4272360.66 frames. ], batch size: 317, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 21:39:34,313 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1424472.0, ans=0.0 2023-06-25 21:39:35,901 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1424472.0, ans=0.0 2023-06-25 21:39:39,896 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1424532.0, ans=0.2 2023-06-25 21:40:07,601 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1424592.0, ans=0.1 2023-06-25 21:40:21,293 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1424592.0, ans=0.025 2023-06-25 21:40:22,807 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1424592.0, ans=0.1 2023-06-25 21:40:40,466 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1424652.0, ans=0.125 2023-06-25 21:41:08,045 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1424712.0, ans=0.0 2023-06-25 21:41:11,187 INFO [train.py:996] (0/4) Epoch 8, batch 24000, loss[loss=0.2496, simple_loss=0.3607, pruned_loss=0.06921, over 19889.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.3043, pruned_loss=0.07456, over 4267746.39 frames. ], batch size: 703, lr: 3.66e-03, grad_scale: 32.0 2023-06-25 21:41:11,188 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-25 21:41:29,306 INFO [train.py:1028] (0/4) Epoch 8, validation: loss=0.2655, simple_loss=0.3581, pruned_loss=0.0864, over 1796401.00 frames. 2023-06-25 21:41:29,307 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23714MB 2023-06-25 21:42:49,032 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.318e+02 4.591e+02 6.093e+02 8.134e+02 1.870e+03, threshold=1.219e+03, percent-clipped=5.0 2023-06-25 21:43:18,438 INFO [train.py:996] (0/4) Epoch 8, batch 24050, loss[loss=0.2161, simple_loss=0.3062, pruned_loss=0.06301, over 21843.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.3064, pruned_loss=0.07544, over 4266563.65 frames. ], batch size: 371, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 21:43:55,844 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1425132.0, ans=0.1 2023-06-25 21:45:14,058 INFO [train.py:996] (0/4) Epoch 8, batch 24100, loss[loss=0.2402, simple_loss=0.3204, pruned_loss=0.08001, over 21261.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.307, pruned_loss=0.07474, over 4270810.13 frames. ], batch size: 548, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 21:45:34,369 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1425372.0, ans=0.125 2023-06-25 21:45:35,035 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.54 vs. limit=15.0 2023-06-25 21:45:56,576 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=1425492.0, ans=10.0 2023-06-25 21:46:03,612 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1425492.0, ans=0.125 2023-06-25 21:46:14,059 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.11 vs. limit=22.5 2023-06-25 21:46:27,131 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.209e+02 4.362e+02 5.817e+02 7.695e+02 1.790e+03, threshold=1.163e+03, percent-clipped=6.0 2023-06-25 21:46:31,945 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1425552.0, ans=0.0 2023-06-25 21:47:02,462 INFO [train.py:996] (0/4) Epoch 8, batch 24150, loss[loss=0.1905, simple_loss=0.2404, pruned_loss=0.07031, over 20336.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3052, pruned_loss=0.07566, over 4270630.79 frames. ], batch size: 703, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 21:47:19,646 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1425672.0, ans=0.1 2023-06-25 21:48:14,865 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.27 vs. limit=22.5 2023-06-25 21:48:23,474 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.20 vs. limit=10.0 2023-06-25 21:48:58,699 INFO [train.py:996] (0/4) Epoch 8, batch 24200, loss[loss=0.2254, simple_loss=0.3117, pruned_loss=0.0695, over 21730.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.309, pruned_loss=0.07767, over 4279439.63 frames. ], batch size: 298, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 21:49:38,605 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1426032.0, ans=0.05 2023-06-25 21:49:40,162 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1426032.0, ans=0.0 2023-06-25 21:49:49,197 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1426092.0, ans=0.0 2023-06-25 21:49:58,280 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1426092.0, ans=0.5 2023-06-25 21:50:13,420 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.946e+02 4.269e+02 5.400e+02 8.843e+02 1.561e+03, threshold=1.080e+03, percent-clipped=7.0 2023-06-25 21:50:17,534 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1426152.0, ans=0.0 2023-06-25 21:50:18,149 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.40 vs. limit=15.0 2023-06-25 21:50:37,619 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1426212.0, ans=0.1 2023-06-25 21:50:49,359 INFO [train.py:996] (0/4) Epoch 8, batch 24250, loss[loss=0.1715, simple_loss=0.274, pruned_loss=0.03453, over 21827.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.3055, pruned_loss=0.0729, over 4269084.03 frames. ], batch size: 316, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 21:52:17,366 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.82 vs. limit=6.0 2023-06-25 21:52:36,542 INFO [train.py:996] (0/4) Epoch 8, batch 24300, loss[loss=0.1788, simple_loss=0.2661, pruned_loss=0.04577, over 21809.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.2989, pruned_loss=0.06745, over 4275791.31 frames. ], batch size: 316, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 21:53:11,714 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.11 vs. limit=15.0 2023-06-25 21:53:48,816 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.598e+02 3.813e+02 5.438e+02 8.323e+02 1.746e+03, threshold=1.088e+03, percent-clipped=13.0 2023-06-25 21:54:11,039 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1426812.0, ans=0.125 2023-06-25 21:54:21,217 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1426812.0, ans=0.125 2023-06-25 21:54:29,437 INFO [train.py:996] (0/4) Epoch 8, batch 24350, loss[loss=0.2628, simple_loss=0.3419, pruned_loss=0.09184, over 21788.00 frames. ], tot_loss[loss=0.2156, simple_loss=0.296, pruned_loss=0.06761, over 4283280.83 frames. ], batch size: 124, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 21:54:39,190 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.26 vs. limit=15.0 2023-06-25 21:54:41,131 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.98 vs. limit=12.0 2023-06-25 21:55:16,223 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1426992.0, ans=0.0 2023-06-25 21:55:17,753 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1426992.0, ans=0.2 2023-06-25 21:55:22,972 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1426992.0, ans=0.1 2023-06-25 21:55:37,942 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.41 vs. limit=15.0 2023-06-25 21:56:18,889 INFO [train.py:996] (0/4) Epoch 8, batch 24400, loss[loss=0.2259, simple_loss=0.3167, pruned_loss=0.06755, over 17188.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.3003, pruned_loss=0.07043, over 4277847.33 frames. ], batch size: 60, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 21:56:44,659 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1427232.0, ans=0.1 2023-06-25 21:57:04,156 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1427292.0, ans=0.125 2023-06-25 21:57:29,771 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.945e+02 4.612e+02 5.722e+02 8.222e+02 2.006e+03, threshold=1.144e+03, percent-clipped=13.0 2023-06-25 21:58:07,731 INFO [train.py:996] (0/4) Epoch 8, batch 24450, loss[loss=0.2126, simple_loss=0.2825, pruned_loss=0.07138, over 21721.00 frames. ], tot_loss[loss=0.222, simple_loss=0.301, pruned_loss=0.0715, over 4272370.66 frames. ], batch size: 333, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 21:58:10,141 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1427472.0, ans=0.2 2023-06-25 21:58:48,021 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.57 vs. limit=22.5 2023-06-25 21:59:08,387 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.99 vs. limit=15.0 2023-06-25 21:59:16,664 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1427652.0, ans=0.125 2023-06-25 21:59:18,409 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1427652.0, ans=0.0 2023-06-25 21:59:18,424 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1427652.0, ans=0.2 2023-06-25 21:59:23,641 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=1427652.0, ans=10.0 2023-06-25 21:59:50,822 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1427712.0, ans=0.025 2023-06-25 21:59:55,381 INFO [train.py:996] (0/4) Epoch 8, batch 24500, loss[loss=0.1886, simple_loss=0.2752, pruned_loss=0.05104, over 21790.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.3025, pruned_loss=0.07187, over 4280208.13 frames. ], batch size: 247, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 21:59:56,102 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1427772.0, ans=0.0 2023-06-25 22:00:51,813 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.06 vs. limit=6.0 2023-06-25 22:01:04,623 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.043e+02 4.093e+02 5.380e+02 7.688e+02 2.312e+03, threshold=1.076e+03, percent-clipped=10.0 2023-06-25 22:01:47,746 INFO [train.py:996] (0/4) Epoch 8, batch 24550, loss[loss=0.2272, simple_loss=0.3056, pruned_loss=0.07445, over 21734.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3054, pruned_loss=0.07369, over 4288344.88 frames. ], batch size: 332, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 22:01:53,647 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1428072.0, ans=0.125 2023-06-25 22:03:27,157 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1428312.0, ans=0.125 2023-06-25 22:03:30,475 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1428312.0, ans=0.1 2023-06-25 22:03:33,882 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1428372.0, ans=0.0 2023-06-25 22:03:34,844 INFO [train.py:996] (0/4) Epoch 8, batch 24600, loss[loss=0.2034, simple_loss=0.2654, pruned_loss=0.07077, over 21607.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.3016, pruned_loss=0.07385, over 4287107.77 frames. ], batch size: 231, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 22:03:50,603 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1428432.0, ans=0.0 2023-06-25 22:03:52,202 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1428432.0, ans=0.125 2023-06-25 22:04:43,326 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.133e+02 4.316e+02 5.425e+02 7.027e+02 1.651e+03, threshold=1.085e+03, percent-clipped=8.0 2023-06-25 22:05:21,821 INFO [train.py:996] (0/4) Epoch 8, batch 24650, loss[loss=0.1889, simple_loss=0.259, pruned_loss=0.05939, over 21472.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.2946, pruned_loss=0.07288, over 4277291.29 frames. ], batch size: 195, lr: 3.65e-03, grad_scale: 16.0 2023-06-25 22:05:26,018 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1428672.0, ans=0.1 2023-06-25 22:05:56,976 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1428732.0, ans=0.0 2023-06-25 22:06:31,338 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1428852.0, ans=0.125 2023-06-25 22:07:04,948 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1428912.0, ans=0.125 2023-06-25 22:07:07,949 INFO [train.py:996] (0/4) Epoch 8, batch 24700, loss[loss=0.2516, simple_loss=0.2986, pruned_loss=0.1022, over 21438.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2922, pruned_loss=0.07145, over 4272458.52 frames. ], batch size: 509, lr: 3.65e-03, grad_scale: 16.0 2023-06-25 22:07:17,189 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1428972.0, ans=0.0 2023-06-25 22:07:40,355 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1429032.0, ans=0.125 2023-06-25 22:07:47,113 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1429092.0, ans=0.125 2023-06-25 22:08:17,100 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.664e+02 4.405e+02 6.289e+02 8.929e+02 2.025e+03, threshold=1.258e+03, percent-clipped=12.0 2023-06-25 22:08:49,442 INFO [train.py:996] (0/4) Epoch 8, batch 24750, loss[loss=0.2049, simple_loss=0.2652, pruned_loss=0.07232, over 14979.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.286, pruned_loss=0.06849, over 4269536.84 frames. ], batch size: 62, lr: 3.65e-03, grad_scale: 16.0 2023-06-25 22:08:50,108 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1429272.0, ans=0.125 2023-06-25 22:09:04,089 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1429272.0, ans=0.125 2023-06-25 22:09:12,939 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1429332.0, ans=0.2 2023-06-25 22:09:15,166 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.95 vs. limit=22.5 2023-06-25 22:09:37,622 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1429392.0, ans=0.09899494936611666 2023-06-25 22:10:21,160 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1429512.0, ans=0.125 2023-06-25 22:10:37,858 INFO [train.py:996] (0/4) Epoch 8, batch 24800, loss[loss=0.2328, simple_loss=0.2882, pruned_loss=0.08867, over 21614.00 frames. ], tot_loss[loss=0.2074, simple_loss=0.2797, pruned_loss=0.06761, over 4265000.14 frames. ], batch size: 441, lr: 3.65e-03, grad_scale: 32.0 2023-06-25 22:10:38,523 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1429572.0, ans=0.125 2023-06-25 22:10:40,233 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1429572.0, ans=0.035 2023-06-25 22:11:49,246 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.829e+02 4.217e+02 5.954e+02 8.314e+02 1.595e+03, threshold=1.191e+03, percent-clipped=9.0 2023-06-25 22:12:20,371 INFO [train.py:996] (0/4) Epoch 8, batch 24850, loss[loss=0.2071, simple_loss=0.2752, pruned_loss=0.06951, over 21157.00 frames. ], tot_loss[loss=0.2083, simple_loss=0.2793, pruned_loss=0.06863, over 4267914.38 frames. ], batch size: 608, lr: 3.65e-03, grad_scale: 16.0 2023-06-25 22:12:28,883 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.89 vs. limit=10.0 2023-06-25 22:12:33,246 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1429872.0, ans=0.1 2023-06-25 22:12:34,902 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1429872.0, ans=0.07 2023-06-25 22:14:01,774 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1430112.0, ans=0.5 2023-06-25 22:14:09,952 INFO [train.py:996] (0/4) Epoch 8, batch 24900, loss[loss=0.189, simple_loss=0.259, pruned_loss=0.05949, over 21718.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.2822, pruned_loss=0.06947, over 4270494.44 frames. ], batch size: 247, lr: 3.65e-03, grad_scale: 16.0 2023-06-25 22:14:17,823 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1430172.0, ans=0.05 2023-06-25 22:14:21,280 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1430172.0, ans=0.125 2023-06-25 22:15:28,744 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1430352.0, ans=0.125 2023-06-25 22:15:31,589 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.069e+02 4.057e+02 5.546e+02 7.694e+02 2.051e+03, threshold=1.109e+03, percent-clipped=6.0 2023-06-25 22:15:58,286 INFO [train.py:996] (0/4) Epoch 8, batch 24950, loss[loss=0.1897, simple_loss=0.2318, pruned_loss=0.0738, over 20322.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2896, pruned_loss=0.07312, over 4274523.08 frames. ], batch size: 703, lr: 3.65e-03, grad_scale: 16.0 2023-06-25 22:17:01,710 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.46 vs. limit=15.0 2023-06-25 22:17:07,925 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1430592.0, ans=0.2 2023-06-25 22:17:08,546 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.33 vs. limit=10.0 2023-06-25 22:17:32,321 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1430712.0, ans=0.1 2023-06-25 22:17:38,844 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1430712.0, ans=0.125 2023-06-25 22:17:46,834 INFO [train.py:996] (0/4) Epoch 8, batch 25000, loss[loss=0.2363, simple_loss=0.3373, pruned_loss=0.06767, over 16681.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.2962, pruned_loss=0.0754, over 4269368.48 frames. ], batch size: 60, lr: 3.65e-03, grad_scale: 16.0 2023-06-25 22:17:57,335 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.44 vs. limit=22.5 2023-06-25 22:18:08,819 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1430832.0, ans=0.125 2023-06-25 22:18:37,746 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1430832.0, ans=0.125 2023-06-25 22:18:56,300 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1430892.0, ans=0.05 2023-06-25 22:19:03,042 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1430952.0, ans=0.0 2023-06-25 22:19:03,761 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.87 vs. limit=22.5 2023-06-25 22:19:07,076 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.48 vs. limit=15.0 2023-06-25 22:19:07,611 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.298e+02 4.363e+02 6.743e+02 9.687e+02 1.962e+03, threshold=1.349e+03, percent-clipped=15.0 2023-06-25 22:19:32,749 INFO [train.py:996] (0/4) Epoch 8, batch 25050, loss[loss=0.1802, simple_loss=0.2394, pruned_loss=0.06055, over 21223.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2888, pruned_loss=0.07317, over 4277305.46 frames. ], batch size: 549, lr: 3.65e-03, grad_scale: 16.0 2023-06-25 22:20:01,676 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.84 vs. limit=15.0 2023-06-25 22:20:52,733 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1431252.0, ans=0.0 2023-06-25 22:21:19,836 INFO [train.py:996] (0/4) Epoch 8, batch 25100, loss[loss=0.2068, simple_loss=0.2721, pruned_loss=0.07075, over 21554.00 frames. ], tot_loss[loss=0.2133, simple_loss=0.2832, pruned_loss=0.0717, over 4275826.60 frames. ], batch size: 247, lr: 3.65e-03, grad_scale: 16.0 2023-06-25 22:22:00,562 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1431432.0, ans=0.0 2023-06-25 22:22:28,587 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1431492.0, ans=0.125 2023-06-25 22:22:34,262 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.54 vs. limit=15.0 2023-06-25 22:22:41,589 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.039e+02 4.362e+02 5.445e+02 8.840e+02 1.769e+03, threshold=1.089e+03, percent-clipped=5.0 2023-06-25 22:23:07,176 INFO [train.py:996] (0/4) Epoch 8, batch 25150, loss[loss=0.2052, simple_loss=0.2959, pruned_loss=0.05722, over 21854.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2879, pruned_loss=0.07019, over 4263575.07 frames. ], batch size: 118, lr: 3.65e-03, grad_scale: 16.0 2023-06-25 22:23:41,213 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1431732.0, ans=0.0 2023-06-25 22:23:58,949 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.83 vs. limit=15.0 2023-06-25 22:24:10,815 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1431792.0, ans=0.0 2023-06-25 22:24:30,331 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1431852.0, ans=0.2 2023-06-25 22:24:55,115 INFO [train.py:996] (0/4) Epoch 8, batch 25200, loss[loss=0.1966, simple_loss=0.27, pruned_loss=0.06161, over 21915.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.2881, pruned_loss=0.06836, over 4271551.26 frames. ], batch size: 107, lr: 3.65e-03, grad_scale: 32.0 2023-06-25 22:26:02,169 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1432092.0, ans=0.125 2023-06-25 22:26:14,826 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.15 vs. limit=15.0 2023-06-25 22:26:18,385 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.791e+02 3.750e+02 5.347e+02 7.396e+02 1.859e+03, threshold=1.069e+03, percent-clipped=8.0 2023-06-25 22:26:41,770 INFO [train.py:996] (0/4) Epoch 8, batch 25250, loss[loss=0.2111, simple_loss=0.2756, pruned_loss=0.07331, over 21868.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.286, pruned_loss=0.0674, over 4267872.13 frames. ], batch size: 98, lr: 3.65e-03, grad_scale: 16.0 2023-06-25 22:26:45,908 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1432272.0, ans=0.125 2023-06-25 22:27:17,650 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1432332.0, ans=0.2 2023-06-25 22:28:02,566 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.07 vs. limit=15.0 2023-06-25 22:28:03,849 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1432452.0, ans=0.125 2023-06-25 22:28:04,451 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.90 vs. limit=15.0 2023-06-25 22:28:12,602 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1432512.0, ans=0.2 2023-06-25 22:28:29,133 INFO [train.py:996] (0/4) Epoch 8, batch 25300, loss[loss=0.2456, simple_loss=0.3235, pruned_loss=0.08385, over 21248.00 frames. ], tot_loss[loss=0.208, simple_loss=0.2839, pruned_loss=0.06611, over 4271713.40 frames. ], batch size: 143, lr: 3.65e-03, grad_scale: 16.0 2023-06-25 22:28:29,760 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1432572.0, ans=0.125 2023-06-25 22:29:53,837 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.880e+02 4.048e+02 5.397e+02 7.800e+02 1.560e+03, threshold=1.079e+03, percent-clipped=8.0 2023-06-25 22:30:06,388 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1432812.0, ans=0.1 2023-06-25 22:30:17,503 INFO [train.py:996] (0/4) Epoch 8, batch 25350, loss[loss=0.2115, simple_loss=0.2902, pruned_loss=0.06639, over 21611.00 frames. ], tot_loss[loss=0.2094, simple_loss=0.2868, pruned_loss=0.06604, over 4269866.87 frames. ], batch size: 414, lr: 3.65e-03, grad_scale: 16.0 2023-06-25 22:31:21,329 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1432992.0, ans=0.0 2023-06-25 22:31:22,776 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1432992.0, ans=0.1 2023-06-25 22:31:59,559 INFO [train.py:996] (0/4) Epoch 8, batch 25400, loss[loss=0.2124, simple_loss=0.2685, pruned_loss=0.07818, over 21498.00 frames. ], tot_loss[loss=0.206, simple_loss=0.2824, pruned_loss=0.06483, over 4267109.82 frames. ], batch size: 441, lr: 3.65e-03, grad_scale: 16.0 2023-06-25 22:32:07,097 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1433172.0, ans=0.125 2023-06-25 22:32:33,379 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1433232.0, ans=0.0 2023-06-25 22:32:48,388 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1433292.0, ans=0.0 2023-06-25 22:33:18,942 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1433352.0, ans=0.0 2023-06-25 22:33:21,724 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.049e+02 4.073e+02 6.227e+02 9.020e+02 1.627e+03, threshold=1.245e+03, percent-clipped=13.0 2023-06-25 22:33:39,598 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1433412.0, ans=0.125 2023-06-25 22:33:45,985 INFO [train.py:996] (0/4) Epoch 8, batch 25450, loss[loss=0.2088, simple_loss=0.2875, pruned_loss=0.06503, over 21834.00 frames. ], tot_loss[loss=0.2075, simple_loss=0.2825, pruned_loss=0.06627, over 4264797.65 frames. ], batch size: 118, lr: 3.65e-03, grad_scale: 16.0 2023-06-25 22:33:50,394 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1433472.0, ans=0.125 2023-06-25 22:35:14,294 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1433712.0, ans=0.05 2023-06-25 22:35:31,623 INFO [train.py:996] (0/4) Epoch 8, batch 25500, loss[loss=0.1929, simple_loss=0.2725, pruned_loss=0.05662, over 15789.00 frames. ], tot_loss[loss=0.2046, simple_loss=0.2826, pruned_loss=0.06335, over 4252250.96 frames. ], batch size: 62, lr: 3.65e-03, grad_scale: 16.0 2023-06-25 22:35:45,317 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1433772.0, ans=0.125 2023-06-25 22:36:30,991 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 22:36:39,577 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 22:36:40,082 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.86 vs. limit=15.0 2023-06-25 22:36:41,740 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1433892.0, ans=0.125 2023-06-25 22:36:45,984 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.88 vs. limit=5.0 2023-06-25 22:36:56,880 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.763e+02 3.870e+02 4.829e+02 7.230e+02 1.638e+03, threshold=9.659e+02, percent-clipped=1.0 2023-06-25 22:37:21,628 INFO [train.py:996] (0/4) Epoch 8, batch 25550, loss[loss=0.2437, simple_loss=0.3423, pruned_loss=0.07252, over 21573.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.2903, pruned_loss=0.06442, over 4251249.58 frames. ], batch size: 471, lr: 3.65e-03, grad_scale: 16.0 2023-06-25 22:38:07,540 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1434132.0, ans=0.1 2023-06-25 22:38:25,181 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1434192.0, ans=0.1 2023-06-25 22:38:27,058 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1434192.0, ans=0.125 2023-06-25 22:38:46,061 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1434252.0, ans=0.1 2023-06-25 22:39:20,115 INFO [train.py:996] (0/4) Epoch 8, batch 25600, loss[loss=0.3031, simple_loss=0.3604, pruned_loss=0.1229, over 21477.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.2954, pruned_loss=0.06598, over 4259916.36 frames. ], batch size: 471, lr: 3.65e-03, grad_scale: 32.0 2023-06-25 22:39:53,449 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1434432.0, ans=0.0 2023-06-25 22:40:03,342 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1434492.0, ans=0.07 2023-06-25 22:40:03,356 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1434492.0, ans=0.125 2023-06-25 22:40:10,398 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1434492.0, ans=0.125 2023-06-25 22:40:10,413 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1434492.0, ans=0.125 2023-06-25 22:40:31,983 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.098e+02 4.217e+02 6.682e+02 9.360e+02 1.950e+03, threshold=1.336e+03, percent-clipped=22.0 2023-06-25 22:40:34,176 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1434552.0, ans=0.125 2023-06-25 22:41:11,583 INFO [train.py:996] (0/4) Epoch 8, batch 25650, loss[loss=0.212, simple_loss=0.2708, pruned_loss=0.07654, over 21593.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2954, pruned_loss=0.06863, over 4261292.71 frames. ], batch size: 415, lr: 3.65e-03, grad_scale: 32.0 2023-06-25 22:42:17,076 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1434852.0, ans=0.0 2023-06-25 22:42:58,588 INFO [train.py:996] (0/4) Epoch 8, batch 25700, loss[loss=0.2748, simple_loss=0.4041, pruned_loss=0.07274, over 19755.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2919, pruned_loss=0.0693, over 4255735.80 frames. ], batch size: 702, lr: 3.65e-03, grad_scale: 32.0 2023-06-25 22:43:32,147 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1435032.0, ans=0.125 2023-06-25 22:44:03,577 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1435152.0, ans=0.1 2023-06-25 22:44:05,623 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1435152.0, ans=0.125 2023-06-25 22:44:06,643 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.951e+02 3.970e+02 5.194e+02 7.142e+02 1.504e+03, threshold=1.039e+03, percent-clipped=1.0 2023-06-25 22:44:20,212 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1435212.0, ans=0.0 2023-06-25 22:44:29,456 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1435212.0, ans=0.125 2023-06-25 22:44:52,957 INFO [train.py:996] (0/4) Epoch 8, batch 25750, loss[loss=0.257, simple_loss=0.354, pruned_loss=0.08002, over 21721.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.2979, pruned_loss=0.07178, over 4263043.49 frames. ], batch size: 332, lr: 3.65e-03, grad_scale: 32.0 2023-06-25 22:45:01,451 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1435272.0, ans=0.1 2023-06-25 22:45:10,948 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1435332.0, ans=0.125 2023-06-25 22:45:12,433 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1435332.0, ans=0.0 2023-06-25 22:45:25,547 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1435332.0, ans=0.125 2023-06-25 22:45:25,663 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1435332.0, ans=0.1 2023-06-25 22:45:29,234 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1435392.0, ans=0.125 2023-06-25 22:45:35,010 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.91 vs. limit=6.0 2023-06-25 22:46:22,651 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1435452.0, ans=0.125 2023-06-25 22:46:44,471 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1435572.0, ans=0.2 2023-06-25 22:46:45,452 INFO [train.py:996] (0/4) Epoch 8, batch 25800, loss[loss=0.2761, simple_loss=0.3549, pruned_loss=0.09864, over 21389.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3104, pruned_loss=0.0766, over 4260832.88 frames. ], batch size: 159, lr: 3.65e-03, grad_scale: 16.0 2023-06-25 22:47:49,064 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1435692.0, ans=0.0 2023-06-25 22:48:11,402 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.235e+02 4.952e+02 6.520e+02 9.122e+02 2.118e+03, threshold=1.304e+03, percent-clipped=17.0 2023-06-25 22:48:13,702 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1435752.0, ans=0.125 2023-06-25 22:48:33,954 INFO [train.py:996] (0/4) Epoch 8, batch 25850, loss[loss=0.2144, simple_loss=0.2821, pruned_loss=0.0734, over 21550.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3115, pruned_loss=0.07623, over 4268109.44 frames. ], batch size: 211, lr: 3.65e-03, grad_scale: 16.0 2023-06-25 22:48:48,873 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1435872.0, ans=0.035 2023-06-25 22:48:54,570 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1435932.0, ans=0.125 2023-06-25 22:49:05,491 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=14.95 vs. limit=15.0 2023-06-25 22:49:32,820 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1435992.0, ans=0.2 2023-06-25 22:50:05,197 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.36 vs. limit=15.0 2023-06-25 22:50:18,925 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1436112.0, ans=0.0 2023-06-25 22:50:22,313 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1436172.0, ans=0.04949747468305833 2023-06-25 22:50:23,342 INFO [train.py:996] (0/4) Epoch 8, batch 25900, loss[loss=0.2746, simple_loss=0.3674, pruned_loss=0.09095, over 21812.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3129, pruned_loss=0.07743, over 4274219.59 frames. ], batch size: 351, lr: 3.65e-03, grad_scale: 16.0 2023-06-25 22:51:16,099 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1436292.0, ans=0.125 2023-06-25 22:51:43,656 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.598e+02 5.216e+02 8.298e+02 1.003e+03 1.891e+03, threshold=1.660e+03, percent-clipped=7.0 2023-06-25 22:52:05,584 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1436472.0, ans=0.2 2023-06-25 22:52:06,598 INFO [train.py:996] (0/4) Epoch 8, batch 25950, loss[loss=0.2494, simple_loss=0.324, pruned_loss=0.08738, over 21812.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.3198, pruned_loss=0.08077, over 4272572.45 frames. ], batch size: 282, lr: 3.64e-03, grad_scale: 16.0 2023-06-25 22:52:29,413 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1436472.0, ans=0.125 2023-06-25 22:52:40,417 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.23 vs. limit=15.0 2023-06-25 22:53:26,045 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1436652.0, ans=0.0 2023-06-25 22:53:31,978 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.60 vs. limit=15.0 2023-06-25 22:53:58,751 INFO [train.py:996] (0/4) Epoch 8, batch 26000, loss[loss=0.24, simple_loss=0.3251, pruned_loss=0.07747, over 21687.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3184, pruned_loss=0.0781, over 4265989.05 frames. ], batch size: 351, lr: 3.64e-03, grad_scale: 32.0 2023-06-25 22:54:20,730 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1436772.0, ans=0.125 2023-06-25 22:54:44,880 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.36 vs. limit=15.0 2023-06-25 22:55:20,036 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.00 vs. limit=15.0 2023-06-25 22:55:20,391 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.951e+02 4.119e+02 5.246e+02 6.904e+02 1.299e+03, threshold=1.049e+03, percent-clipped=0.0 2023-06-25 22:55:27,731 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1437012.0, ans=0.1 2023-06-25 22:55:44,261 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.28 vs. limit=15.0 2023-06-25 22:55:46,758 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1437072.0, ans=0.125 2023-06-25 22:55:47,869 INFO [train.py:996] (0/4) Epoch 8, batch 26050, loss[loss=0.2515, simple_loss=0.3134, pruned_loss=0.09481, over 21810.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.3187, pruned_loss=0.0802, over 4272062.67 frames. ], batch size: 441, lr: 3.64e-03, grad_scale: 32.0 2023-06-25 22:55:57,484 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1437072.0, ans=0.0 2023-06-25 22:55:57,569 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1437072.0, ans=0.1 2023-06-25 22:56:31,642 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1437192.0, ans=0.125 2023-06-25 22:56:39,030 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.86 vs. limit=15.0 2023-06-25 22:56:53,499 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1437252.0, ans=0.125 2023-06-25 22:57:09,205 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1437312.0, ans=0.125 2023-06-25 22:57:28,593 INFO [train.py:996] (0/4) Epoch 8, batch 26100, loss[loss=0.2132, simple_loss=0.2761, pruned_loss=0.07515, over 21350.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3117, pruned_loss=0.07923, over 4286521.31 frames. ], batch size: 176, lr: 3.64e-03, grad_scale: 32.0 2023-06-25 22:58:44,092 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.195e+02 4.438e+02 5.651e+02 7.112e+02 1.480e+03, threshold=1.130e+03, percent-clipped=4.0 2023-06-25 22:59:22,540 INFO [train.py:996] (0/4) Epoch 8, batch 26150, loss[loss=0.2495, simple_loss=0.3245, pruned_loss=0.08731, over 21237.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3098, pruned_loss=0.07975, over 4292900.12 frames. ], batch size: 143, lr: 3.64e-03, grad_scale: 16.0 2023-06-25 22:59:23,207 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1437672.0, ans=0.1 2023-06-25 23:00:17,120 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.89 vs. limit=22.5 2023-06-25 23:00:21,980 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1437852.0, ans=0.0 2023-06-25 23:00:32,473 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1437852.0, ans=0.125 2023-06-25 23:00:48,062 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1437912.0, ans=0.2 2023-06-25 23:01:12,284 INFO [train.py:996] (0/4) Epoch 8, batch 26200, loss[loss=0.2179, simple_loss=0.3238, pruned_loss=0.05602, over 21658.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3102, pruned_loss=0.07699, over 4294862.55 frames. ], batch size: 414, lr: 3.64e-03, grad_scale: 16.0 2023-06-25 23:01:42,157 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff2.min_abs, batch_count=1438032.0, ans=0.1 2023-06-25 23:01:50,802 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1438092.0, ans=0.1 2023-06-25 23:02:20,816 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1438152.0, ans=0.125 2023-06-25 23:02:23,567 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.227e+02 4.470e+02 5.888e+02 8.750e+02 1.495e+03, threshold=1.178e+03, percent-clipped=8.0 2023-06-25 23:02:55,433 INFO [train.py:996] (0/4) Epoch 8, batch 26250, loss[loss=0.2161, simple_loss=0.2946, pruned_loss=0.06885, over 21913.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3127, pruned_loss=0.07573, over 4297994.24 frames. ], batch size: 351, lr: 3.64e-03, grad_scale: 16.0 2023-06-25 23:02:59,624 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1438272.0, ans=0.125 2023-06-25 23:03:49,024 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1438392.0, ans=0.125 2023-06-25 23:04:28,459 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1438512.0, ans=0.125 2023-06-25 23:04:36,267 INFO [train.py:996] (0/4) Epoch 8, batch 26300, loss[loss=0.227, simple_loss=0.2931, pruned_loss=0.08051, over 21582.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3097, pruned_loss=0.07696, over 4303928.38 frames. ], batch size: 195, lr: 3.64e-03, grad_scale: 16.0 2023-06-25 23:05:07,310 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.38 vs. limit=15.0 2023-06-25 23:05:10,580 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1438692.0, ans=0.125 2023-06-25 23:06:03,912 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.391e+02 4.218e+02 5.396e+02 7.440e+02 1.508e+03, threshold=1.079e+03, percent-clipped=2.0 2023-06-25 23:06:21,598 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 23:06:24,593 INFO [train.py:996] (0/4) Epoch 8, batch 26350, loss[loss=0.2555, simple_loss=0.3264, pruned_loss=0.09229, over 21400.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3082, pruned_loss=0.07731, over 4304716.87 frames. ], batch size: 548, lr: 3.64e-03, grad_scale: 16.0 2023-06-25 23:07:11,255 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1438992.0, ans=0.07 2023-06-25 23:08:04,760 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1439112.0, ans=0.0 2023-06-25 23:08:11,325 INFO [train.py:996] (0/4) Epoch 8, batch 26400, loss[loss=0.1937, simple_loss=0.2604, pruned_loss=0.06345, over 21251.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.3027, pruned_loss=0.07728, over 4284237.33 frames. ], batch size: 176, lr: 3.64e-03, grad_scale: 32.0 2023-06-25 23:09:36,046 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.120e+02 4.025e+02 5.044e+02 7.451e+02 1.741e+03, threshold=1.009e+03, percent-clipped=9.0 2023-06-25 23:09:57,686 INFO [train.py:996] (0/4) Epoch 8, batch 26450, loss[loss=0.224, simple_loss=0.3196, pruned_loss=0.06417, over 19853.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.3015, pruned_loss=0.07605, over 4274971.11 frames. ], batch size: 707, lr: 3.64e-03, grad_scale: 32.0 2023-06-25 23:10:18,232 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1439472.0, ans=0.0 2023-06-25 23:10:18,794 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.12 vs. limit=6.0 2023-06-25 23:10:20,853 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.35 vs. limit=22.5 2023-06-25 23:10:32,361 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1439532.0, ans=0.125 2023-06-25 23:10:43,694 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_na.min_abs, batch_count=1439592.0, ans=0.02 2023-06-25 23:11:48,720 INFO [train.py:996] (0/4) Epoch 8, batch 26500, loss[loss=0.1683, simple_loss=0.2292, pruned_loss=0.05366, over 21274.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.301, pruned_loss=0.07441, over 4272772.06 frames. ], batch size: 143, lr: 3.64e-03, grad_scale: 16.0 2023-06-25 23:12:19,512 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.52 vs. limit=10.0 2023-06-25 23:12:33,225 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1439832.0, ans=0.125 2023-06-25 23:13:17,920 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-240000.pt 2023-06-25 23:13:23,152 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.879e+02 4.514e+02 6.896e+02 1.400e+03 2.768e+03, threshold=1.379e+03, percent-clipped=34.0 2023-06-25 23:13:53,894 INFO [train.py:996] (0/4) Epoch 8, batch 26550, loss[loss=0.1868, simple_loss=0.2773, pruned_loss=0.04821, over 21720.00 frames. ], tot_loss[loss=0.221, simple_loss=0.2984, pruned_loss=0.07181, over 4261307.99 frames. ], batch size: 298, lr: 3.64e-03, grad_scale: 16.0 2023-06-25 23:14:30,478 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1440132.0, ans=0.0 2023-06-25 23:14:32,045 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1440132.0, ans=0.125 2023-06-25 23:14:55,941 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1440252.0, ans=0.1 2023-06-25 23:15:18,884 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.77 vs. limit=12.0 2023-06-25 23:15:33,429 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1440312.0, ans=0.2 2023-06-25 23:15:33,457 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1440312.0, ans=0.125 2023-06-25 23:15:35,116 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1440312.0, ans=0.0 2023-06-25 23:15:47,316 INFO [train.py:996] (0/4) Epoch 8, batch 26600, loss[loss=0.2023, simple_loss=0.2717, pruned_loss=0.06639, over 21255.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.2992, pruned_loss=0.06993, over 4262707.01 frames. ], batch size: 131, lr: 3.64e-03, grad_scale: 16.0 2023-06-25 23:16:05,150 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1440432.0, ans=0.95 2023-06-25 23:16:32,474 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.05 vs. limit=10.0 2023-06-25 23:16:33,871 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1440492.0, ans=0.0 2023-06-25 23:16:38,621 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1440492.0, ans=0.125 2023-06-25 23:17:00,248 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.847e+02 4.407e+02 5.733e+02 8.512e+02 1.391e+03, threshold=1.147e+03, percent-clipped=1.0 2023-06-25 23:17:31,146 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1440612.0, ans=0.0 2023-06-25 23:17:35,791 INFO [train.py:996] (0/4) Epoch 8, batch 26650, loss[loss=0.1621, simple_loss=0.2356, pruned_loss=0.04427, over 21403.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2932, pruned_loss=0.06889, over 4263405.26 frames. ], batch size: 131, lr: 3.64e-03, grad_scale: 16.0 2023-06-25 23:18:07,704 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.48 vs. limit=15.0 2023-06-25 23:18:08,855 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1440792.0, ans=0.125 2023-06-25 23:19:15,822 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.94 vs. limit=15.0 2023-06-25 23:19:18,177 INFO [train.py:996] (0/4) Epoch 8, batch 26700, loss[loss=0.2329, simple_loss=0.3119, pruned_loss=0.07701, over 21875.00 frames. ], tot_loss[loss=0.2088, simple_loss=0.2856, pruned_loss=0.06596, over 4266016.06 frames. ], batch size: 107, lr: 3.64e-03, grad_scale: 16.0 2023-06-25 23:19:40,533 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.04 vs. limit=6.0 2023-06-25 23:19:59,630 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1441092.0, ans=0.5 2023-06-25 23:20:35,393 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.74 vs. limit=15.0 2023-06-25 23:20:37,585 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.642e+02 3.824e+02 5.567e+02 8.569e+02 1.745e+03, threshold=1.113e+03, percent-clipped=13.0 2023-06-25 23:20:45,332 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1441212.0, ans=0.125 2023-06-25 23:21:01,564 INFO [train.py:996] (0/4) Epoch 8, batch 26750, loss[loss=0.2593, simple_loss=0.3398, pruned_loss=0.08943, over 21406.00 frames. ], tot_loss[loss=0.208, simple_loss=0.2861, pruned_loss=0.06498, over 4275765.19 frames. ], batch size: 131, lr: 3.64e-03, grad_scale: 16.0 2023-06-25 23:21:09,590 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=1441272.0, ans=10.0 2023-06-25 23:21:15,684 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.86 vs. limit=15.0 2023-06-25 23:21:33,744 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1441332.0, ans=0.2 2023-06-25 23:21:39,608 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.77 vs. limit=15.0 2023-06-25 23:22:32,667 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1441512.0, ans=0.125 2023-06-25 23:22:32,678 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1441512.0, ans=0.125 2023-06-25 23:22:46,503 INFO [train.py:996] (0/4) Epoch 8, batch 26800, loss[loss=0.2304, simple_loss=0.3057, pruned_loss=0.07755, over 21857.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2939, pruned_loss=0.06912, over 4275352.86 frames. ], batch size: 247, lr: 3.64e-03, grad_scale: 32.0 2023-06-25 23:22:47,431 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.76 vs. limit=15.0 2023-06-25 23:24:14,274 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.253e+02 4.422e+02 6.215e+02 9.798e+02 1.990e+03, threshold=1.243e+03, percent-clipped=9.0 2023-06-25 23:24:38,145 INFO [train.py:996] (0/4) Epoch 8, batch 26850, loss[loss=0.2023, simple_loss=0.2605, pruned_loss=0.07207, over 20682.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.2956, pruned_loss=0.07146, over 4276137.34 frames. ], batch size: 607, lr: 3.64e-03, grad_scale: 32.0 2023-06-25 23:25:25,298 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1441992.0, ans=0.125 2023-06-25 23:25:41,926 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=7.33 vs. limit=12.0 2023-06-25 23:25:46,517 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1442052.0, ans=0.1 2023-06-25 23:26:15,640 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1442112.0, ans=0.125 2023-06-25 23:26:20,006 INFO [train.py:996] (0/4) Epoch 8, batch 26900, loss[loss=0.2225, simple_loss=0.325, pruned_loss=0.06002, over 19878.00 frames. ], tot_loss[loss=0.2156, simple_loss=0.2894, pruned_loss=0.07094, over 4276542.97 frames. ], batch size: 702, lr: 3.64e-03, grad_scale: 16.0 2023-06-25 23:26:43,059 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1442232.0, ans=10.0 2023-06-25 23:27:19,769 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1442292.0, ans=0.0 2023-06-25 23:27:21,530 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1442352.0, ans=0.05 2023-06-25 23:27:39,363 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1442412.0, ans=0.125 2023-06-25 23:27:40,240 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.068e+02 3.925e+02 6.896e+02 1.001e+03 2.184e+03, threshold=1.379e+03, percent-clipped=14.0 2023-06-25 23:27:56,600 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1442412.0, ans=0.95 2023-06-25 23:28:00,045 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1442412.0, ans=0.0 2023-06-25 23:28:02,725 INFO [train.py:996] (0/4) Epoch 8, batch 26950, loss[loss=0.2022, simple_loss=0.2916, pruned_loss=0.05635, over 21567.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2872, pruned_loss=0.07046, over 4271368.13 frames. ], batch size: 230, lr: 3.64e-03, grad_scale: 16.0 2023-06-25 23:28:03,458 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1442472.0, ans=0.1 2023-06-25 23:29:09,264 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.71 vs. limit=15.0 2023-06-25 23:29:28,934 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.61 vs. limit=12.0 2023-06-25 23:29:52,143 INFO [train.py:996] (0/4) Epoch 8, batch 27000, loss[loss=0.2259, simple_loss=0.3217, pruned_loss=0.06509, over 21611.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2884, pruned_loss=0.06904, over 4266121.47 frames. ], batch size: 442, lr: 3.64e-03, grad_scale: 16.0 2023-06-25 23:29:52,145 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-25 23:30:10,466 INFO [train.py:1028] (0/4) Epoch 8, validation: loss=0.2506, simple_loss=0.341, pruned_loss=0.08006, over 1796401.00 frames. 2023-06-25 23:30:10,467 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23714MB 2023-06-25 23:30:11,219 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1442772.0, ans=0.2 2023-06-25 23:30:12,917 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1442772.0, ans=0.125 2023-06-25 23:30:20,485 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.07 vs. limit=15.0 2023-06-25 23:30:21,922 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1442772.0, ans=0.1 2023-06-25 23:31:31,577 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1443012.0, ans=0.1 2023-06-25 23:31:32,478 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.635e+02 4.043e+02 5.265e+02 7.888e+02 2.132e+03, threshold=1.053e+03, percent-clipped=7.0 2023-06-25 23:31:49,547 INFO [train.py:996] (0/4) Epoch 8, batch 27050, loss[loss=0.2504, simple_loss=0.3241, pruned_loss=0.08837, over 21758.00 frames. ], tot_loss[loss=0.212, simple_loss=0.2916, pruned_loss=0.06616, over 4269958.33 frames. ], batch size: 441, lr: 3.64e-03, grad_scale: 16.0 2023-06-25 23:33:14,272 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff3.min_abs, batch_count=1443252.0, ans=0.2 2023-06-25 23:33:38,796 INFO [train.py:996] (0/4) Epoch 8, batch 27100, loss[loss=0.2515, simple_loss=0.3286, pruned_loss=0.08723, over 21568.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.294, pruned_loss=0.06657, over 4280276.87 frames. ], batch size: 471, lr: 3.64e-03, grad_scale: 16.0 2023-06-25 23:34:10,596 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1443372.0, ans=0.0 2023-06-25 23:34:49,494 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.42 vs. limit=15.0 2023-06-25 23:34:55,497 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1443552.0, ans=0.0 2023-06-25 23:34:59,774 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.68 vs. limit=15.0 2023-06-25 23:35:11,015 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.112e+02 4.566e+02 6.448e+02 9.782e+02 2.509e+03, threshold=1.290e+03, percent-clipped=22.0 2023-06-25 23:35:20,722 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1443612.0, ans=0.125 2023-06-25 23:35:33,793 INFO [train.py:996] (0/4) Epoch 8, batch 27150, loss[loss=0.2787, simple_loss=0.3737, pruned_loss=0.09185, over 21672.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.3036, pruned_loss=0.0697, over 4283456.64 frames. ], batch size: 414, lr: 3.64e-03, grad_scale: 16.0 2023-06-25 23:36:09,612 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 23:36:21,900 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1443792.0, ans=0.0 2023-06-25 23:36:41,203 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1443852.0, ans=0.125 2023-06-25 23:36:42,902 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1443852.0, ans=0.0 2023-06-25 23:36:58,348 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=1443912.0, ans=0.05 2023-06-25 23:37:28,321 INFO [train.py:996] (0/4) Epoch 8, batch 27200, loss[loss=0.2514, simple_loss=0.3258, pruned_loss=0.08856, over 21390.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.3131, pruned_loss=0.07317, over 4288602.21 frames. ], batch size: 211, lr: 3.64e-03, grad_scale: 32.0 2023-06-25 23:38:14,858 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 23:38:16,960 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.45 vs. limit=15.0 2023-06-25 23:38:32,386 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1444152.0, ans=0.125 2023-06-25 23:39:01,338 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.411e+02 4.854e+02 6.757e+02 9.648e+02 1.735e+03, threshold=1.351e+03, percent-clipped=9.0 2023-06-25 23:39:09,098 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1444212.0, ans=0.125 2023-06-25 23:39:09,106 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1444212.0, ans=0.125 2023-06-25 23:39:18,854 INFO [train.py:996] (0/4) Epoch 8, batch 27250, loss[loss=0.2611, simple_loss=0.3292, pruned_loss=0.09648, over 21586.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3167, pruned_loss=0.07673, over 4286771.51 frames. ], batch size: 415, lr: 3.64e-03, grad_scale: 16.0 2023-06-25 23:39:47,976 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1444332.0, ans=0.1 2023-06-25 23:39:58,923 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1444392.0, ans=0.0 2023-06-25 23:40:20,343 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1444392.0, ans=0.125 2023-06-25 23:40:59,818 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 23:41:14,494 INFO [train.py:996] (0/4) Epoch 8, batch 27300, loss[loss=0.2271, simple_loss=0.3131, pruned_loss=0.07054, over 21680.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3187, pruned_loss=0.07788, over 4283940.02 frames. ], batch size: 263, lr: 3.63e-03, grad_scale: 16.0 2023-06-25 23:42:33,408 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1444752.0, ans=0.07 2023-06-25 23:42:38,676 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1444752.0, ans=0.0 2023-06-25 23:42:43,066 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.204e+02 4.436e+02 5.757e+02 8.260e+02 1.524e+03, threshold=1.151e+03, percent-clipped=4.0 2023-06-25 23:43:03,213 INFO [train.py:996] (0/4) Epoch 8, batch 27350, loss[loss=0.2396, simple_loss=0.3238, pruned_loss=0.07774, over 21246.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.321, pruned_loss=0.07826, over 4271995.13 frames. ], batch size: 176, lr: 3.63e-03, grad_scale: 16.0 2023-06-25 23:43:08,035 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.58 vs. limit=22.5 2023-06-25 23:43:12,593 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1444872.0, ans=0.025 2023-06-25 23:43:52,866 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.66 vs. limit=15.0 2023-06-25 23:44:03,298 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1444992.0, ans=0.0 2023-06-25 23:44:50,147 INFO [train.py:996] (0/4) Epoch 8, batch 27400, loss[loss=0.2024, simple_loss=0.2654, pruned_loss=0.06966, over 21572.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3151, pruned_loss=0.07679, over 4267757.55 frames. ], batch size: 263, lr: 3.63e-03, grad_scale: 16.0 2023-06-25 23:45:09,155 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=16.18 vs. limit=22.5 2023-06-25 23:45:11,688 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1445232.0, ans=0.1 2023-06-25 23:45:15,174 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1445232.0, ans=0.125 2023-06-25 23:45:49,734 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1445292.0, ans=0.125 2023-06-25 23:46:14,724 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.145e+02 3.925e+02 4.930e+02 6.414e+02 1.207e+03, threshold=9.861e+02, percent-clipped=2.0 2023-06-25 23:46:33,501 INFO [train.py:996] (0/4) Epoch 8, batch 27450, loss[loss=0.2177, simple_loss=0.3076, pruned_loss=0.06393, over 21894.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.3075, pruned_loss=0.07446, over 4269218.55 frames. ], batch size: 372, lr: 3.63e-03, grad_scale: 8.0 2023-06-25 23:46:51,429 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.91 vs. limit=15.0 2023-06-25 23:47:25,123 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1445592.0, ans=0.125 2023-06-25 23:47:39,936 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1445652.0, ans=0.2 2023-06-25 23:47:43,109 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1445652.0, ans=0.0 2023-06-25 23:47:55,834 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.86 vs. limit=22.5 2023-06-25 23:48:18,948 INFO [train.py:996] (0/4) Epoch 8, batch 27500, loss[loss=0.2189, simple_loss=0.2917, pruned_loss=0.07312, over 21250.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3058, pruned_loss=0.07487, over 4276078.24 frames. ], batch size: 143, lr: 3.63e-03, grad_scale: 8.0 2023-06-25 23:49:35,630 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.99 vs. limit=6.0 2023-06-25 23:49:42,675 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.981e+02 3.778e+02 4.835e+02 6.283e+02 1.305e+03, threshold=9.670e+02, percent-clipped=1.0 2023-06-25 23:50:01,315 INFO [train.py:996] (0/4) Epoch 8, batch 27550, loss[loss=0.1794, simple_loss=0.2548, pruned_loss=0.05205, over 21350.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.2996, pruned_loss=0.07169, over 4275822.59 frames. ], batch size: 211, lr: 3.63e-03, grad_scale: 8.0 2023-06-25 23:50:15,750 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1446072.0, ans=0.125 2023-06-25 23:51:49,479 INFO [train.py:996] (0/4) Epoch 8, batch 27600, loss[loss=0.2338, simple_loss=0.3051, pruned_loss=0.0813, over 20072.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2928, pruned_loss=0.07074, over 4276454.98 frames. ], batch size: 702, lr: 3.63e-03, grad_scale: 16.0 2023-06-25 23:51:50,222 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1446372.0, ans=0.1 2023-06-25 23:52:56,469 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 23:52:58,481 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1446552.0, ans=0.0 2023-06-25 23:53:16,411 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.997e+02 3.759e+02 4.592e+02 6.391e+02 1.970e+03, threshold=9.184e+02, percent-clipped=8.0 2023-06-25 23:53:17,554 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.04 vs. limit=12.0 2023-06-25 23:53:33,860 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1446672.0, ans=0.2 2023-06-25 23:53:34,792 INFO [train.py:996] (0/4) Epoch 8, batch 27650, loss[loss=0.2215, simple_loss=0.2888, pruned_loss=0.07705, over 21290.00 frames. ], tot_loss[loss=0.2142, simple_loss=0.2873, pruned_loss=0.0706, over 4274845.36 frames. ], batch size: 176, lr: 3.63e-03, grad_scale: 16.0 2023-06-25 23:53:36,963 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1446672.0, ans=0.125 2023-06-25 23:54:15,641 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1446732.0, ans=0.125 2023-06-25 23:54:39,761 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1446792.0, ans=0.125 2023-06-25 23:55:21,982 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1446972.0, ans=0.125 2023-06-25 23:55:22,830 INFO [train.py:996] (0/4) Epoch 8, batch 27700, loss[loss=0.1899, simple_loss=0.2821, pruned_loss=0.04889, over 21784.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.29, pruned_loss=0.06974, over 4281036.16 frames. ], batch size: 332, lr: 3.63e-03, grad_scale: 16.0 2023-06-25 23:55:55,211 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1447032.0, ans=0.1 2023-06-25 23:56:55,202 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1447212.0, ans=0.0 2023-06-25 23:56:56,237 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.091e+02 3.950e+02 5.187e+02 7.067e+02 1.545e+03, threshold=1.037e+03, percent-clipped=11.0 2023-06-25 23:57:09,970 INFO [train.py:996] (0/4) Epoch 8, batch 27750, loss[loss=0.2015, simple_loss=0.2894, pruned_loss=0.05675, over 21493.00 frames. ], tot_loss[loss=0.2142, simple_loss=0.2914, pruned_loss=0.06848, over 4282993.96 frames. ], batch size: 211, lr: 3.63e-03, grad_scale: 16.0 2023-06-25 23:58:02,395 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.18 vs. limit=15.0 2023-06-25 23:58:36,232 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.11 vs. limit=15.0 2023-06-25 23:58:54,865 INFO [train.py:996] (0/4) Epoch 8, batch 27800, loss[loss=0.2299, simple_loss=0.3108, pruned_loss=0.07451, over 21801.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.29, pruned_loss=0.06912, over 4286349.51 frames. ], batch size: 112, lr: 3.63e-03, grad_scale: 16.0 2023-06-25 23:59:01,688 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1447572.0, ans=0.125 2023-06-26 00:00:23,976 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.743e+02 4.274e+02 5.854e+02 7.453e+02 1.495e+03, threshold=1.171e+03, percent-clipped=16.0 2023-06-26 00:00:42,965 INFO [train.py:996] (0/4) Epoch 8, batch 27850, loss[loss=0.2298, simple_loss=0.3232, pruned_loss=0.06814, over 21819.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2897, pruned_loss=0.07023, over 4287513.24 frames. ], batch size: 332, lr: 3.63e-03, grad_scale: 16.0 2023-06-26 00:01:34,111 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1447992.0, ans=0.1 2023-06-26 00:01:44,543 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1447992.0, ans=0.0 2023-06-26 00:02:02,359 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1448052.0, ans=0.0 2023-06-26 00:02:05,964 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 00:02:22,011 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1448112.0, ans=0.125 2023-06-26 00:02:22,085 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1448112.0, ans=0.125 2023-06-26 00:02:39,166 INFO [train.py:996] (0/4) Epoch 8, batch 27900, loss[loss=0.2785, simple_loss=0.365, pruned_loss=0.09602, over 21514.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.2995, pruned_loss=0.07207, over 4287003.28 frames. ], batch size: 471, lr: 3.63e-03, grad_scale: 16.0 2023-06-26 00:03:37,195 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1448292.0, ans=0.125 2023-06-26 00:04:15,997 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.747e+02 3.981e+02 4.843e+02 6.105e+02 1.501e+03, threshold=9.685e+02, percent-clipped=1.0 2023-06-26 00:04:22,026 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1448412.0, ans=0.125 2023-06-26 00:04:35,188 INFO [train.py:996] (0/4) Epoch 8, batch 27950, loss[loss=0.2027, simple_loss=0.295, pruned_loss=0.05517, over 21722.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.2987, pruned_loss=0.06845, over 4276179.37 frames. ], batch size: 247, lr: 3.63e-03, grad_scale: 16.0 2023-06-26 00:06:22,322 INFO [train.py:996] (0/4) Epoch 8, batch 28000, loss[loss=0.265, simple_loss=0.3261, pruned_loss=0.102, over 21658.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2959, pruned_loss=0.06716, over 4271776.23 frames. ], batch size: 471, lr: 3.63e-03, grad_scale: 32.0 2023-06-26 00:06:22,929 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1448772.0, ans=0.125 2023-06-26 00:06:31,540 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1448772.0, ans=0.2 2023-06-26 00:07:25,534 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1448952.0, ans=0.04949747468305833 2023-06-26 00:07:58,626 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.068e+02 4.485e+02 6.487e+02 9.458e+02 1.758e+03, threshold=1.297e+03, percent-clipped=21.0 2023-06-26 00:08:10,949 INFO [train.py:996] (0/4) Epoch 8, batch 28050, loss[loss=0.1958, simple_loss=0.2729, pruned_loss=0.05932, over 21807.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.2933, pruned_loss=0.06824, over 4273017.58 frames. ], batch size: 282, lr: 3.63e-03, grad_scale: 16.0 2023-06-26 00:08:18,523 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1449072.0, ans=0.0 2023-06-26 00:09:01,611 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1449192.0, ans=0.125 2023-06-26 00:09:03,468 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1449192.0, ans=0.125 2023-06-26 00:09:22,561 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1449252.0, ans=0.125 2023-06-26 00:09:36,582 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1449252.0, ans=0.05 2023-06-26 00:09:44,938 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1449312.0, ans=0.0 2023-06-26 00:09:57,865 INFO [train.py:996] (0/4) Epoch 8, batch 28100, loss[loss=0.2017, simple_loss=0.2717, pruned_loss=0.06581, over 21778.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2929, pruned_loss=0.06882, over 4273350.87 frames. ], batch size: 118, lr: 3.63e-03, grad_scale: 16.0 2023-06-26 00:10:34,736 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.90 vs. limit=15.0 2023-06-26 00:10:47,919 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1449492.0, ans=0.1 2023-06-26 00:11:10,859 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1449552.0, ans=0.0 2023-06-26 00:11:20,276 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.36 vs. limit=15.0 2023-06-26 00:11:27,657 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.094e+02 4.530e+02 6.783e+02 9.812e+02 2.062e+03, threshold=1.357e+03, percent-clipped=16.0 2023-06-26 00:11:39,998 INFO [train.py:996] (0/4) Epoch 8, batch 28150, loss[loss=0.1629, simple_loss=0.2296, pruned_loss=0.04806, over 21459.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.2864, pruned_loss=0.06854, over 4263232.71 frames. ], batch size: 212, lr: 3.63e-03, grad_scale: 16.0 2023-06-26 00:12:15,158 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1449732.0, ans=0.125 2023-06-26 00:12:26,941 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1449792.0, ans=0.1 2023-06-26 00:12:49,624 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.67 vs. limit=15.0 2023-06-26 00:12:54,026 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1449852.0, ans=0.0 2023-06-26 00:13:13,397 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1449912.0, ans=0.125 2023-06-26 00:13:16,882 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1449912.0, ans=0.0 2023-06-26 00:13:26,689 INFO [train.py:996] (0/4) Epoch 8, batch 28200, loss[loss=0.2408, simple_loss=0.3029, pruned_loss=0.0894, over 21439.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2857, pruned_loss=0.06971, over 4265052.19 frames. ], batch size: 194, lr: 3.63e-03, grad_scale: 16.0 2023-06-26 00:13:27,309 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1449972.0, ans=0.0 2023-06-26 00:13:28,019 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn1.whiten.whitening_limit, batch_count=1449972.0, ans=22.5 2023-06-26 00:13:59,354 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1450032.0, ans=0.1 2023-06-26 00:14:33,906 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.65 vs. limit=15.0 2023-06-26 00:14:45,843 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1450152.0, ans=0.0 2023-06-26 00:14:56,303 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=1450212.0, ans=10.0 2023-06-26 00:15:02,519 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.442e+02 4.547e+02 5.710e+02 8.432e+02 1.923e+03, threshold=1.142e+03, percent-clipped=7.0 2023-06-26 00:15:14,918 INFO [train.py:996] (0/4) Epoch 8, batch 28250, loss[loss=0.2353, simple_loss=0.2901, pruned_loss=0.09026, over 21507.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.2892, pruned_loss=0.07221, over 4262812.55 frames. ], batch size: 441, lr: 3.63e-03, grad_scale: 16.0 2023-06-26 00:15:17,509 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1450272.0, ans=0.125 2023-06-26 00:15:46,330 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.73 vs. limit=22.5 2023-06-26 00:16:33,563 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.49 vs. limit=15.0 2023-06-26 00:16:59,503 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1450512.0, ans=0.0 2023-06-26 00:17:04,139 INFO [train.py:996] (0/4) Epoch 8, batch 28300, loss[loss=0.1823, simple_loss=0.2712, pruned_loss=0.04668, over 21586.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2866, pruned_loss=0.07058, over 4260154.28 frames. ], batch size: 230, lr: 3.63e-03, grad_scale: 16.0 2023-06-26 00:17:25,305 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1450632.0, ans=0.125 2023-06-26 00:18:17,441 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1450752.0, ans=0.2 2023-06-26 00:18:39,090 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.826e+02 4.235e+02 6.929e+02 1.082e+03 2.013e+03, threshold=1.386e+03, percent-clipped=23.0 2023-06-26 00:18:44,960 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1450812.0, ans=0.0 2023-06-26 00:18:45,008 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1450812.0, ans=0.125 2023-06-26 00:18:48,416 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1450812.0, ans=0.2 2023-06-26 00:18:56,497 INFO [train.py:996] (0/4) Epoch 8, batch 28350, loss[loss=0.1847, simple_loss=0.3099, pruned_loss=0.02978, over 20813.00 frames. ], tot_loss[loss=0.2056, simple_loss=0.2823, pruned_loss=0.06448, over 4263290.17 frames. ], batch size: 607, lr: 3.63e-03, grad_scale: 16.0 2023-06-26 00:19:40,384 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1450932.0, ans=0.125 2023-06-26 00:19:40,472 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1450932.0, ans=0.0 2023-06-26 00:19:57,961 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.09 vs. limit=12.0 2023-06-26 00:20:43,848 INFO [train.py:996] (0/4) Epoch 8, batch 28400, loss[loss=0.2279, simple_loss=0.3014, pruned_loss=0.07724, over 21762.00 frames. ], tot_loss[loss=0.2041, simple_loss=0.2786, pruned_loss=0.06479, over 4264738.63 frames. ], batch size: 118, lr: 3.63e-03, grad_scale: 32.0 2023-06-26 00:20:56,910 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1451172.0, ans=0.125 2023-06-26 00:21:30,141 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1451232.0, ans=0.0 2023-06-26 00:21:39,067 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1451292.0, ans=0.125 2023-06-26 00:21:51,417 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.90 vs. limit=15.0 2023-06-26 00:22:11,355 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1451412.0, ans=0.1 2023-06-26 00:22:20,864 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.356e+02 4.435e+02 6.673e+02 8.870e+02 1.776e+03, threshold=1.335e+03, percent-clipped=3.0 2023-06-26 00:22:31,535 INFO [train.py:996] (0/4) Epoch 8, batch 28450, loss[loss=0.2368, simple_loss=0.3008, pruned_loss=0.08644, over 21329.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.2848, pruned_loss=0.06849, over 4265215.81 frames. ], batch size: 159, lr: 3.63e-03, grad_scale: 16.0 2023-06-26 00:22:56,298 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 00:23:14,217 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1451532.0, ans=0.125 2023-06-26 00:24:04,953 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1451712.0, ans=0.125 2023-06-26 00:24:12,284 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1451712.0, ans=0.1 2023-06-26 00:24:30,338 INFO [train.py:996] (0/4) Epoch 8, batch 28500, loss[loss=0.1953, simple_loss=0.2657, pruned_loss=0.06241, over 21937.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.2877, pruned_loss=0.07088, over 4275376.86 frames. ], batch size: 351, lr: 3.63e-03, grad_scale: 16.0 2023-06-26 00:25:04,575 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1451832.0, ans=0.125 2023-06-26 00:25:06,268 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1451832.0, ans=0.0 2023-06-26 00:25:58,854 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1452012.0, ans=0.125 2023-06-26 00:26:09,288 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.427e+02 4.818e+02 6.676e+02 8.470e+02 2.134e+03, threshold=1.335e+03, percent-clipped=3.0 2023-06-26 00:26:19,577 INFO [train.py:996] (0/4) Epoch 8, batch 28550, loss[loss=0.2477, simple_loss=0.3542, pruned_loss=0.07056, over 21695.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.2952, pruned_loss=0.07314, over 4276089.15 frames. ], batch size: 298, lr: 3.63e-03, grad_scale: 16.0 2023-06-26 00:26:33,041 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1452072.0, ans=0.2 2023-06-26 00:26:53,069 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.16 vs. limit=22.5 2023-06-26 00:27:23,279 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.25 vs. limit=6.0 2023-06-26 00:27:50,995 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1452312.0, ans=0.1 2023-06-26 00:28:15,669 INFO [train.py:996] (0/4) Epoch 8, batch 28600, loss[loss=0.2895, simple_loss=0.3468, pruned_loss=0.1161, over 21402.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.3031, pruned_loss=0.07556, over 4276401.46 frames. ], batch size: 471, lr: 3.62e-03, grad_scale: 16.0 2023-06-26 00:29:08,193 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1452492.0, ans=0.125 2023-06-26 00:29:50,516 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 00:29:53,320 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.158e+02 4.451e+02 5.957e+02 7.529e+02 1.462e+03, threshold=1.191e+03, percent-clipped=3.0 2023-06-26 00:29:55,947 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1452612.0, ans=0.125 2023-06-26 00:30:03,853 INFO [train.py:996] (0/4) Epoch 8, batch 28650, loss[loss=0.1952, simple_loss=0.2596, pruned_loss=0.06538, over 21111.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.2971, pruned_loss=0.07435, over 4268810.30 frames. ], batch size: 176, lr: 3.62e-03, grad_scale: 16.0 2023-06-26 00:30:42,959 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1452792.0, ans=0.09899494936611666 2023-06-26 00:31:03,921 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1452792.0, ans=0.125 2023-06-26 00:31:38,392 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.15 vs. limit=12.0 2023-06-26 00:31:47,837 INFO [train.py:996] (0/4) Epoch 8, batch 28700, loss[loss=0.1859, simple_loss=0.2535, pruned_loss=0.05913, over 21278.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.2941, pruned_loss=0.07441, over 4272316.84 frames. ], batch size: 549, lr: 3.62e-03, grad_scale: 16.0 2023-06-26 00:31:50,244 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1452972.0, ans=0.0 2023-06-26 00:33:19,800 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.264e+02 4.611e+02 5.755e+02 7.778e+02 1.501e+03, threshold=1.151e+03, percent-clipped=4.0 2023-06-26 00:33:30,636 INFO [train.py:996] (0/4) Epoch 8, batch 28750, loss[loss=0.2013, simple_loss=0.2857, pruned_loss=0.05848, over 21419.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.2951, pruned_loss=0.07527, over 4274341.84 frames. ], batch size: 131, lr: 3.62e-03, grad_scale: 16.0 2023-06-26 00:34:59,134 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1453452.0, ans=0.125 2023-06-26 00:35:06,041 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1453512.0, ans=0.125 2023-06-26 00:35:07,598 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1453512.0, ans=0.1 2023-06-26 00:35:09,817 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.29 vs. limit=15.0 2023-06-26 00:35:18,781 INFO [train.py:996] (0/4) Epoch 8, batch 28800, loss[loss=0.2433, simple_loss=0.3149, pruned_loss=0.08589, over 21326.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.2991, pruned_loss=0.07565, over 4271502.79 frames. ], batch size: 548, lr: 3.62e-03, grad_scale: 32.0 2023-06-26 00:35:21,455 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1453572.0, ans=0.0 2023-06-26 00:35:28,238 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1453572.0, ans=0.0 2023-06-26 00:36:28,409 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1453692.0, ans=0.025 2023-06-26 00:36:55,664 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.079e+02 4.504e+02 5.803e+02 7.798e+02 1.715e+03, threshold=1.161e+03, percent-clipped=9.0 2023-06-26 00:37:06,156 INFO [train.py:996] (0/4) Epoch 8, batch 28850, loss[loss=0.203, simple_loss=0.266, pruned_loss=0.06998, over 20992.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.3015, pruned_loss=0.07718, over 4279832.53 frames. ], batch size: 607, lr: 3.62e-03, grad_scale: 32.0 2023-06-26 00:38:13,919 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 00:38:40,898 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1454112.0, ans=0.07 2023-06-26 00:39:02,758 INFO [train.py:996] (0/4) Epoch 8, batch 28900, loss[loss=0.2393, simple_loss=0.3091, pruned_loss=0.08471, over 21372.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3035, pruned_loss=0.07871, over 4284188.17 frames. ], batch size: 548, lr: 3.62e-03, grad_scale: 32.0 2023-06-26 00:39:10,906 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1454172.0, ans=0.07 2023-06-26 00:39:18,097 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1454172.0, ans=0.125 2023-06-26 00:39:42,045 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1454232.0, ans=0.125 2023-06-26 00:40:15,716 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1454352.0, ans=0.95 2023-06-26 00:40:18,049 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.42 vs. limit=15.0 2023-06-26 00:40:36,719 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.510e+02 4.525e+02 6.150e+02 8.317e+02 2.231e+03, threshold=1.230e+03, percent-clipped=10.0 2023-06-26 00:40:57,567 INFO [train.py:996] (0/4) Epoch 8, batch 28950, loss[loss=0.2127, simple_loss=0.3134, pruned_loss=0.05602, over 21832.00 frames. ], tot_loss[loss=0.229, simple_loss=0.3034, pruned_loss=0.07737, over 4281679.84 frames. ], batch size: 371, lr: 3.62e-03, grad_scale: 32.0 2023-06-26 00:41:15,245 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=2.99 vs. limit=10.0 2023-06-26 00:41:47,823 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1454592.0, ans=0.125 2023-06-26 00:42:01,683 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1454652.0, ans=0.07 2023-06-26 00:42:39,343 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1454712.0, ans=0.125 2023-06-26 00:42:46,438 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1454772.0, ans=0.125 2023-06-26 00:42:52,301 INFO [train.py:996] (0/4) Epoch 8, batch 29000, loss[loss=0.2352, simple_loss=0.313, pruned_loss=0.07871, over 21820.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3065, pruned_loss=0.07608, over 4274806.45 frames. ], batch size: 247, lr: 3.62e-03, grad_scale: 16.0 2023-06-26 00:43:16,645 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.84 vs. limit=15.0 2023-06-26 00:43:39,359 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.09 vs. limit=22.5 2023-06-26 00:43:59,074 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1454952.0, ans=0.125 2023-06-26 00:44:25,392 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.229e+02 4.694e+02 5.564e+02 8.456e+02 2.061e+03, threshold=1.113e+03, percent-clipped=6.0 2023-06-26 00:44:38,441 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1455072.0, ans=10.0 2023-06-26 00:44:38,553 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1455072.0, ans=0.125 2023-06-26 00:44:39,530 INFO [train.py:996] (0/4) Epoch 8, batch 29050, loss[loss=0.2285, simple_loss=0.2988, pruned_loss=0.0791, over 21835.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3052, pruned_loss=0.07715, over 4284806.82 frames. ], batch size: 441, lr: 3.62e-03, grad_scale: 16.0 2023-06-26 00:44:57,835 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1455072.0, ans=0.1 2023-06-26 00:45:28,981 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1455192.0, ans=0.2 2023-06-26 00:46:27,351 INFO [train.py:996] (0/4) Epoch 8, batch 29100, loss[loss=0.1857, simple_loss=0.2556, pruned_loss=0.05791, over 21763.00 frames. ], tot_loss[loss=0.223, simple_loss=0.2965, pruned_loss=0.07475, over 4286836.03 frames. ], batch size: 112, lr: 3.62e-03, grad_scale: 16.0 2023-06-26 00:48:06,923 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.913e+02 4.309e+02 6.273e+02 8.461e+02 1.678e+03, threshold=1.255e+03, percent-clipped=7.0 2023-06-26 00:48:11,057 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1455612.0, ans=0.0 2023-06-26 00:48:15,342 INFO [train.py:996] (0/4) Epoch 8, batch 29150, loss[loss=0.2274, simple_loss=0.3147, pruned_loss=0.07006, over 21398.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.2957, pruned_loss=0.07304, over 4271571.31 frames. ], batch size: 194, lr: 3.62e-03, grad_scale: 16.0 2023-06-26 00:49:06,465 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1455792.0, ans=0.125 2023-06-26 00:50:08,240 INFO [train.py:996] (0/4) Epoch 8, batch 29200, loss[loss=0.1854, simple_loss=0.2516, pruned_loss=0.05958, over 21746.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.2915, pruned_loss=0.07247, over 4266446.08 frames. ], batch size: 283, lr: 3.62e-03, grad_scale: 32.0 2023-06-26 00:50:30,511 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.06 vs. limit=15.0 2023-06-26 00:51:30,467 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1456152.0, ans=0.2 2023-06-26 00:51:37,503 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1456212.0, ans=0.125 2023-06-26 00:51:42,014 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.239e+02 4.282e+02 5.514e+02 8.024e+02 1.461e+03, threshold=1.103e+03, percent-clipped=3.0 2023-06-26 00:51:56,528 INFO [train.py:996] (0/4) Epoch 8, batch 29250, loss[loss=0.1698, simple_loss=0.2363, pruned_loss=0.05172, over 17377.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2909, pruned_loss=0.07066, over 4262609.25 frames. ], batch size: 67, lr: 3.62e-03, grad_scale: 32.0 2023-06-26 00:51:59,215 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1456272.0, ans=0.125 2023-06-26 00:52:27,413 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1456332.0, ans=0.04949747468305833 2023-06-26 00:53:24,183 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1456512.0, ans=0.0 2023-06-26 00:53:24,238 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1456512.0, ans=0.125 2023-06-26 00:53:44,030 INFO [train.py:996] (0/4) Epoch 8, batch 29300, loss[loss=0.2377, simple_loss=0.294, pruned_loss=0.09073, over 21290.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2922, pruned_loss=0.06941, over 4270754.75 frames. ], batch size: 471, lr: 3.62e-03, grad_scale: 16.0 2023-06-26 00:53:58,607 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 00:54:16,200 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1456632.0, ans=0.125 2023-06-26 00:54:24,743 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1456692.0, ans=0.2 2023-06-26 00:54:35,074 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1456692.0, ans=0.1 2023-06-26 00:54:44,032 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1456692.0, ans=0.125 2023-06-26 00:54:45,951 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1456752.0, ans=0.07 2023-06-26 00:54:53,416 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.14 vs. limit=6.0 2023-06-26 00:55:25,849 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.817e+02 4.100e+02 5.558e+02 8.472e+02 2.092e+03, threshold=1.112e+03, percent-clipped=11.0 2023-06-26 00:55:32,619 INFO [train.py:996] (0/4) Epoch 8, batch 29350, loss[loss=0.2125, simple_loss=0.301, pruned_loss=0.06199, over 21743.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2901, pruned_loss=0.06904, over 4267239.37 frames. ], batch size: 333, lr: 3.62e-03, grad_scale: 16.0 2023-06-26 00:55:35,214 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1456872.0, ans=0.0 2023-06-26 00:56:00,287 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.90 vs. limit=15.0 2023-06-26 00:57:18,430 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1457112.0, ans=0.125 2023-06-26 00:57:21,102 INFO [train.py:996] (0/4) Epoch 8, batch 29400, loss[loss=0.2029, simple_loss=0.3059, pruned_loss=0.04995, over 20767.00 frames. ], tot_loss[loss=0.212, simple_loss=0.2897, pruned_loss=0.06713, over 4259569.99 frames. ], batch size: 608, lr: 3.62e-03, grad_scale: 16.0 2023-06-26 00:57:44,356 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=6.69 vs. limit=22.5 2023-06-26 00:58:01,639 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1457232.0, ans=0.125 2023-06-26 00:58:23,678 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.00 vs. limit=15.0 2023-06-26 00:58:26,219 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1457292.0, ans=0.125 2023-06-26 00:58:37,787 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.18 vs. limit=15.0 2023-06-26 00:59:02,200 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.050e+02 4.516e+02 7.158e+02 1.067e+03 2.108e+03, threshold=1.432e+03, percent-clipped=22.0 2023-06-26 00:59:09,197 INFO [train.py:996] (0/4) Epoch 8, batch 29450, loss[loss=0.1962, simple_loss=0.2911, pruned_loss=0.05066, over 20768.00 frames. ], tot_loss[loss=0.21, simple_loss=0.2878, pruned_loss=0.06614, over 4265775.46 frames. ], batch size: 607, lr: 3.62e-03, grad_scale: 16.0 2023-06-26 01:00:10,416 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1457592.0, ans=0.125 2023-06-26 01:00:36,299 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1457652.0, ans=0.125 2023-06-26 01:00:43,196 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1457712.0, ans=0.0 2023-06-26 01:00:51,861 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1457712.0, ans=0.1 2023-06-26 01:00:56,321 INFO [train.py:996] (0/4) Epoch 8, batch 29500, loss[loss=0.2259, simple_loss=0.3026, pruned_loss=0.07458, over 21388.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2928, pruned_loss=0.06971, over 4274151.25 frames. ], batch size: 131, lr: 3.62e-03, grad_scale: 16.0 2023-06-26 01:01:12,214 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1457772.0, ans=0.1 2023-06-26 01:01:12,237 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1457772.0, ans=0.125 2023-06-26 01:01:48,116 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1457892.0, ans=0.125 2023-06-26 01:01:55,204 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1457892.0, ans=0.125 2023-06-26 01:02:36,145 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.293e+02 4.544e+02 5.932e+02 7.825e+02 1.489e+03, threshold=1.186e+03, percent-clipped=1.0 2023-06-26 01:02:42,877 INFO [train.py:996] (0/4) Epoch 8, batch 29550, loss[loss=0.2431, simple_loss=0.3187, pruned_loss=0.08378, over 21811.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2922, pruned_loss=0.07121, over 4283198.79 frames. ], batch size: 112, lr: 3.62e-03, grad_scale: 16.0 2023-06-26 01:02:53,974 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_ff3.min_abs, batch_count=1458072.0, ans=0.2 2023-06-26 01:03:15,836 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1458132.0, ans=0.0 2023-06-26 01:03:18,240 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn1.whiten.whitening_limit, batch_count=1458132.0, ans=22.5 2023-06-26 01:04:15,699 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1458312.0, ans=0.125 2023-06-26 01:04:19,966 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1458312.0, ans=0.1 2023-06-26 01:04:23,266 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1458312.0, ans=0.09899494936611666 2023-06-26 01:04:40,147 INFO [train.py:996] (0/4) Epoch 8, batch 29600, loss[loss=0.2834, simple_loss=0.3792, pruned_loss=0.09377, over 21275.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.2981, pruned_loss=0.07348, over 4287375.19 frames. ], batch size: 548, lr: 3.62e-03, grad_scale: 32.0 2023-06-26 01:05:44,060 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1458492.0, ans=0.0 2023-06-26 01:05:49,403 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1458552.0, ans=0.125 2023-06-26 01:06:08,461 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1458612.0, ans=0.125 2023-06-26 01:06:21,144 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.719e+02 4.529e+02 7.554e+02 1.096e+03 2.697e+03, threshold=1.511e+03, percent-clipped=19.0 2023-06-26 01:06:27,949 INFO [train.py:996] (0/4) Epoch 8, batch 29650, loss[loss=0.1951, simple_loss=0.2785, pruned_loss=0.0559, over 21645.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.296, pruned_loss=0.07058, over 4278921.95 frames. ], batch size: 441, lr: 3.62e-03, grad_scale: 32.0 2023-06-26 01:06:50,011 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1458732.0, ans=0.1 2023-06-26 01:07:07,509 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1458732.0, ans=0.025 2023-06-26 01:07:36,719 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1458852.0, ans=0.125 2023-06-26 01:07:36,725 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1458852.0, ans=0.2 2023-06-26 01:07:39,937 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1458852.0, ans=0.125 2023-06-26 01:07:41,708 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1458852.0, ans=0.125 2023-06-26 01:08:17,131 INFO [train.py:996] (0/4) Epoch 8, batch 29700, loss[loss=0.2301, simple_loss=0.3372, pruned_loss=0.06147, over 21642.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.2967, pruned_loss=0.07021, over 4282851.28 frames. ], batch size: 263, lr: 3.62e-03, grad_scale: 32.0 2023-06-26 01:08:17,863 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1458972.0, ans=0.125 2023-06-26 01:08:22,131 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.50 vs. limit=15.0 2023-06-26 01:08:50,482 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1459032.0, ans=0.1 2023-06-26 01:09:24,413 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.04 vs. limit=6.0 2023-06-26 01:09:57,653 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.231e+02 4.516e+02 5.860e+02 9.248e+02 1.775e+03, threshold=1.172e+03, percent-clipped=6.0 2023-06-26 01:10:04,580 INFO [train.py:996] (0/4) Epoch 8, batch 29750, loss[loss=0.1963, simple_loss=0.2762, pruned_loss=0.0582, over 21446.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.3016, pruned_loss=0.07039, over 4277363.79 frames. ], batch size: 131, lr: 3.62e-03, grad_scale: 32.0 2023-06-26 01:10:10,528 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1459272.0, ans=0.125 2023-06-26 01:11:15,436 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1459452.0, ans=0.0 2023-06-26 01:11:39,443 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1459512.0, ans=0.125 2023-06-26 01:11:51,253 INFO [train.py:996] (0/4) Epoch 8, batch 29800, loss[loss=0.2061, simple_loss=0.2847, pruned_loss=0.06378, over 21528.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.3033, pruned_loss=0.07124, over 4286188.77 frames. ], batch size: 194, lr: 3.62e-03, grad_scale: 16.0 2023-06-26 01:11:53,629 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1459572.0, ans=0.125 2023-06-26 01:12:18,669 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1459632.0, ans=0.125 2023-06-26 01:12:54,216 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1459692.0, ans=0.125 2023-06-26 01:12:55,705 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 01:13:24,632 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1459812.0, ans=0.125 2023-06-26 01:13:32,345 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.767e+02 3.928e+02 4.572e+02 6.290e+02 1.025e+03, threshold=9.144e+02, percent-clipped=0.0 2023-06-26 01:13:37,436 INFO [train.py:996] (0/4) Epoch 8, batch 29850, loss[loss=0.203, simple_loss=0.2787, pruned_loss=0.06359, over 21871.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.2982, pruned_loss=0.06965, over 4285475.71 frames. ], batch size: 316, lr: 3.62e-03, grad_scale: 16.0 2023-06-26 01:14:13,139 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.59 vs. limit=15.0 2023-06-26 01:14:20,031 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.11 vs. limit=22.5 2023-06-26 01:14:43,851 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.09 vs. limit=15.0 2023-06-26 01:14:50,580 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.52 vs. limit=15.0 2023-06-26 01:15:20,097 INFO [train.py:996] (0/4) Epoch 8, batch 29900, loss[loss=0.2312, simple_loss=0.3044, pruned_loss=0.07897, over 21329.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2963, pruned_loss=0.07021, over 4290253.94 frames. ], batch size: 176, lr: 3.62e-03, grad_scale: 16.0 2023-06-26 01:15:45,304 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1460172.0, ans=0.125 2023-06-26 01:15:48,762 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1460232.0, ans=0.125 2023-06-26 01:16:18,174 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1460292.0, ans=0.0 2023-06-26 01:17:10,425 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.335e+02 4.671e+02 6.480e+02 9.712e+02 1.710e+03, threshold=1.296e+03, percent-clipped=28.0 2023-06-26 01:17:15,539 INFO [train.py:996] (0/4) Epoch 8, batch 29950, loss[loss=0.209, simple_loss=0.2704, pruned_loss=0.07377, over 20263.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.2993, pruned_loss=0.07321, over 4277366.36 frames. ], batch size: 707, lr: 3.61e-03, grad_scale: 16.0 2023-06-26 01:17:34,199 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.76 vs. limit=15.0 2023-06-26 01:17:35,336 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1460532.0, ans=0.125 2023-06-26 01:18:13,063 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.62 vs. limit=12.0 2023-06-26 01:18:23,713 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.49 vs. limit=22.5 2023-06-26 01:18:46,862 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1460712.0, ans=10.0 2023-06-26 01:19:00,379 INFO [train.py:996] (0/4) Epoch 8, batch 30000, loss[loss=0.2027, simple_loss=0.3033, pruned_loss=0.05105, over 21646.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.3017, pruned_loss=0.07356, over 4285214.46 frames. ], batch size: 389, lr: 3.61e-03, grad_scale: 32.0 2023-06-26 01:19:00,381 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-26 01:19:12,618 INFO [zipformer.py:1728] (0/4) name=encoder.encoders.5.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([1.6752, 3.8542, 3.8680, 3.3420], device='cuda:0') 2023-06-26 01:19:18,797 INFO [train.py:1028] (0/4) Epoch 8, validation: loss=0.2464, simple_loss=0.3452, pruned_loss=0.07378, over 1796401.00 frames. 2023-06-26 01:19:18,799 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23714MB 2023-06-26 01:21:04,475 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1461012.0, ans=0.0 2023-06-26 01:21:10,200 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1461012.0, ans=10.0 2023-06-26 01:21:14,674 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.858e+02 4.174e+02 5.657e+02 7.922e+02 1.669e+03, threshold=1.131e+03, percent-clipped=1.0 2023-06-26 01:21:20,175 INFO [train.py:996] (0/4) Epoch 8, batch 30050, loss[loss=0.2591, simple_loss=0.374, pruned_loss=0.07215, over 21636.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.3053, pruned_loss=0.07069, over 4275853.06 frames. ], batch size: 414, lr: 3.61e-03, grad_scale: 32.0 2023-06-26 01:21:31,835 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.96 vs. limit=15.0 2023-06-26 01:21:35,125 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1461072.0, ans=0.1 2023-06-26 01:21:45,436 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1461132.0, ans=0.125 2023-06-26 01:22:42,035 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1461252.0, ans=0.2 2023-06-26 01:22:52,054 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 01:22:52,142 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1461312.0, ans=0.0 2023-06-26 01:23:13,428 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.72 vs. limit=15.0 2023-06-26 01:23:13,786 INFO [train.py:996] (0/4) Epoch 8, batch 30100, loss[loss=0.2155, simple_loss=0.2751, pruned_loss=0.0779, over 21367.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.3051, pruned_loss=0.07052, over 4269002.46 frames. ], batch size: 144, lr: 3.61e-03, grad_scale: 16.0 2023-06-26 01:23:50,504 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1461432.0, ans=0.2 2023-06-26 01:24:53,945 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.090e+02 4.517e+02 6.270e+02 9.720e+02 3.054e+03, threshold=1.254e+03, percent-clipped=16.0 2023-06-26 01:24:57,518 INFO [train.py:996] (0/4) Epoch 8, batch 30150, loss[loss=0.2262, simple_loss=0.2947, pruned_loss=0.07886, over 21659.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.3005, pruned_loss=0.07207, over 4268527.94 frames. ], batch size: 351, lr: 3.61e-03, grad_scale: 16.0 2023-06-26 01:25:39,315 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 01:26:13,475 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1461852.0, ans=0.125 2023-06-26 01:26:33,113 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1461912.0, ans=0.0 2023-06-26 01:26:53,770 INFO [train.py:996] (0/4) Epoch 8, batch 30200, loss[loss=0.2236, simple_loss=0.3131, pruned_loss=0.06706, over 21645.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.3036, pruned_loss=0.07158, over 4270498.74 frames. ], batch size: 263, lr: 3.61e-03, grad_scale: 16.0 2023-06-26 01:26:56,306 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1461972.0, ans=0.125 2023-06-26 01:27:12,856 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1461972.0, ans=0.0 2023-06-26 01:27:37,931 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.65 vs. limit=22.5 2023-06-26 01:27:41,060 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1462092.0, ans=0.5 2023-06-26 01:28:02,404 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1462152.0, ans=0.125 2023-06-26 01:28:02,465 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1462152.0, ans=0.125 2023-06-26 01:28:45,643 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.354e+02 5.048e+02 7.227e+02 1.023e+03 2.150e+03, threshold=1.445e+03, percent-clipped=15.0 2023-06-26 01:28:48,925 INFO [train.py:996] (0/4) Epoch 8, batch 30250, loss[loss=0.298, simple_loss=0.3813, pruned_loss=0.1073, over 21508.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3107, pruned_loss=0.07393, over 4274205.99 frames. ], batch size: 471, lr: 3.61e-03, grad_scale: 16.0 2023-06-26 01:29:21,929 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1462332.0, ans=0.07 2023-06-26 01:29:38,761 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1462392.0, ans=0.125 2023-06-26 01:29:59,883 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=6.55 vs. limit=22.5 2023-06-26 01:30:03,472 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.91 vs. limit=10.0 2023-06-26 01:30:21,984 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1462512.0, ans=0.125 2023-06-26 01:30:29,013 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1462512.0, ans=0.125 2023-06-26 01:30:36,887 INFO [train.py:996] (0/4) Epoch 8, batch 30300, loss[loss=0.1965, simple_loss=0.2615, pruned_loss=0.06571, over 21519.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3087, pruned_loss=0.07412, over 4266972.77 frames. ], batch size: 414, lr: 3.61e-03, grad_scale: 16.0 2023-06-26 01:30:39,491 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1462572.0, ans=0.125 2023-06-26 01:30:52,649 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1462572.0, ans=0.0 2023-06-26 01:31:33,766 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1462692.0, ans=0.0 2023-06-26 01:32:31,192 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.196e+02 5.174e+02 6.761e+02 1.021e+03 2.632e+03, threshold=1.352e+03, percent-clipped=10.0 2023-06-26 01:32:34,766 INFO [train.py:996] (0/4) Epoch 8, batch 30350, loss[loss=0.3333, simple_loss=0.4134, pruned_loss=0.1266, over 21469.00 frames. ], tot_loss[loss=0.229, simple_loss=0.3076, pruned_loss=0.07523, over 4262519.04 frames. ], batch size: 471, lr: 3.61e-03, grad_scale: 16.0 2023-06-26 01:32:41,322 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.74 vs. limit=10.0 2023-06-26 01:33:15,479 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1462992.0, ans=0.2 2023-06-26 01:33:20,094 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1462992.0, ans=0.125 2023-06-26 01:33:27,520 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1463052.0, ans=0.125 2023-06-26 01:33:34,957 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1463052.0, ans=0.125 2023-06-26 01:33:40,977 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1463112.0, ans=0.125 2023-06-26 01:33:56,345 INFO [train.py:996] (0/4) Epoch 8, batch 30400, loss[loss=0.2081, simple_loss=0.2553, pruned_loss=0.08043, over 20263.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.3023, pruned_loss=0.07368, over 4249428.05 frames. ], batch size: 703, lr: 3.61e-03, grad_scale: 32.0 2023-06-26 01:34:12,399 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1463172.0, ans=0.0 2023-06-26 01:34:24,587 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1463232.0, ans=0.0 2023-06-26 01:34:31,507 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.25 vs. limit=15.0 2023-06-26 01:35:24,305 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.064e+02 6.383e+02 1.075e+03 1.632e+03 7.193e+03, threshold=2.149e+03, percent-clipped=36.0 2023-06-26 01:35:25,757 INFO [train.py:996] (0/4) Epoch 8, batch 30450, loss[loss=0.2764, simple_loss=0.3854, pruned_loss=0.08371, over 19869.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.3026, pruned_loss=0.07332, over 4192924.22 frames. ], batch size: 702, lr: 3.61e-03, grad_scale: 16.0 2023-06-26 01:35:48,196 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1463532.0, ans=0.125 2023-06-26 01:36:17,580 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1463652.0, ans=0.2 2023-06-26 01:36:39,716 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/epoch-8.pt 2023-06-26 01:38:50,977 INFO [train.py:996] (0/4) Epoch 9, batch 0, loss[loss=0.211, simple_loss=0.279, pruned_loss=0.07153, over 21545.00 frames. ], tot_loss[loss=0.211, simple_loss=0.279, pruned_loss=0.07153, over 21545.00 frames. ], batch size: 391, lr: 3.39e-03, grad_scale: 32.0 2023-06-26 01:38:50,979 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-26 01:39:14,231 INFO [train.py:1028] (0/4) Epoch 9, validation: loss=0.2395, simple_loss=0.3459, pruned_loss=0.06656, over 1796401.00 frames. 2023-06-26 01:39:14,232 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23714MB 2023-06-26 01:40:00,293 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1463862.0, ans=0.0 2023-06-26 01:40:28,988 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1463922.0, ans=0.125 2023-06-26 01:40:37,705 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-244000.pt 2023-06-26 01:40:59,310 INFO [train.py:996] (0/4) Epoch 9, batch 50, loss[loss=0.1906, simple_loss=0.2691, pruned_loss=0.05606, over 21215.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.299, pruned_loss=0.07219, over 954721.06 frames. ], batch size: 159, lr: 3.39e-03, grad_scale: 16.0 2023-06-26 01:41:03,128 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1464042.0, ans=0.125 2023-06-26 01:41:13,451 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.197e+02 4.855e+02 1.072e+03 2.293e+03 5.497e+03, threshold=2.144e+03, percent-clipped=28.0 2023-06-26 01:41:33,275 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1464102.0, ans=0.05 2023-06-26 01:41:57,191 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1464162.0, ans=0.0 2023-06-26 01:42:40,949 INFO [train.py:996] (0/4) Epoch 9, batch 100, loss[loss=0.2546, simple_loss=0.3438, pruned_loss=0.08272, over 19888.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3176, pruned_loss=0.074, over 1682070.16 frames. ], batch size: 702, lr: 3.39e-03, grad_scale: 16.0 2023-06-26 01:43:01,247 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1464342.0, ans=0.2 2023-06-26 01:43:35,816 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1464462.0, ans=0.125 2023-06-26 01:43:46,335 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1464462.0, ans=0.0 2023-06-26 01:44:15,074 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1464582.0, ans=0.1 2023-06-26 01:44:26,128 INFO [train.py:996] (0/4) Epoch 9, batch 150, loss[loss=0.189, simple_loss=0.2786, pruned_loss=0.04967, over 21368.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3195, pruned_loss=0.07345, over 2263264.68 frames. ], batch size: 194, lr: 3.39e-03, grad_scale: 16.0 2023-06-26 01:44:40,672 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.204e+02 4.415e+02 5.834e+02 7.944e+02 1.480e+03, threshold=1.167e+03, percent-clipped=0.0 2023-06-26 01:44:58,429 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.64 vs. limit=10.0 2023-06-26 01:45:14,323 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1464702.0, ans=0.125 2023-06-26 01:45:28,397 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1464762.0, ans=0.0 2023-06-26 01:45:51,986 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1464822.0, ans=0.125 2023-06-26 01:45:58,442 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1464882.0, ans=0.125 2023-06-26 01:46:02,358 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1464882.0, ans=0.0 2023-06-26 01:46:13,225 INFO [train.py:996] (0/4) Epoch 9, batch 200, loss[loss=0.2081, simple_loss=0.2927, pruned_loss=0.06175, over 21760.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3181, pruned_loss=0.07297, over 2712224.63 frames. ], batch size: 282, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 01:47:21,426 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1465062.0, ans=0.5 2023-06-26 01:48:00,454 INFO [train.py:996] (0/4) Epoch 9, batch 250, loss[loss=0.2042, simple_loss=0.3008, pruned_loss=0.05376, over 21676.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3139, pruned_loss=0.07322, over 3059759.87 frames. ], batch size: 247, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 01:48:08,787 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.143e+02 4.378e+02 6.069e+02 8.721e+02 1.562e+03, threshold=1.214e+03, percent-clipped=10.0 2023-06-26 01:48:13,354 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1465242.0, ans=0.125 2023-06-26 01:48:24,064 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1465302.0, ans=0.2 2023-06-26 01:49:01,862 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1465362.0, ans=0.125 2023-06-26 01:49:07,256 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1465362.0, ans=0.125 2023-06-26 01:49:10,887 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1465362.0, ans=0.1 2023-06-26 01:49:23,837 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.26 vs. limit=15.0 2023-06-26 01:49:37,112 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1465482.0, ans=0.125 2023-06-26 01:49:43,683 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1465482.0, ans=0.125 2023-06-26 01:49:45,578 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1465482.0, ans=0.125 2023-06-26 01:49:49,357 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1465542.0, ans=0.2 2023-06-26 01:49:50,367 INFO [train.py:996] (0/4) Epoch 9, batch 300, loss[loss=0.2167, simple_loss=0.3222, pruned_loss=0.05557, over 21757.00 frames. ], tot_loss[loss=0.225, simple_loss=0.3054, pruned_loss=0.07223, over 3326626.36 frames. ], batch size: 351, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 01:50:03,183 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1465542.0, ans=0.125 2023-06-26 01:50:22,397 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.57 vs. limit=15.0 2023-06-26 01:50:23,343 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1465602.0, ans=0.0 2023-06-26 01:50:32,052 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 01:51:38,512 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1465782.0, ans=0.125 2023-06-26 01:51:41,281 INFO [train.py:996] (0/4) Epoch 9, batch 350, loss[loss=0.1988, simple_loss=0.2603, pruned_loss=0.06864, over 21577.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.3005, pruned_loss=0.0706, over 3533457.58 frames. ], batch size: 298, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 01:51:50,486 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.975e+02 4.637e+02 6.282e+02 9.202e+02 1.945e+03, threshold=1.256e+03, percent-clipped=12.0 2023-06-26 01:52:10,643 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1465902.0, ans=0.1 2023-06-26 01:52:10,742 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1465902.0, ans=0.125 2023-06-26 01:52:58,014 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1466022.0, ans=0.1 2023-06-26 01:53:23,265 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1466082.0, ans=0.0 2023-06-26 01:53:30,987 INFO [train.py:996] (0/4) Epoch 9, batch 400, loss[loss=0.2422, simple_loss=0.2978, pruned_loss=0.09328, over 21370.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2949, pruned_loss=0.0697, over 3694619.39 frames. ], batch size: 473, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 01:54:01,813 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1466142.0, ans=0.125 2023-06-26 01:54:25,000 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1466202.0, ans=0.125 2023-06-26 01:54:37,489 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1466262.0, ans=0.125 2023-06-26 01:54:43,960 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1466322.0, ans=0.5 2023-06-26 01:54:47,744 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1466322.0, ans=0.125 2023-06-26 01:54:59,859 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 01:55:20,930 INFO [train.py:996] (0/4) Epoch 9, batch 450, loss[loss=0.1984, simple_loss=0.2469, pruned_loss=0.07493, over 20225.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2947, pruned_loss=0.06837, over 3826090.48 frames. ], batch size: 703, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 01:55:21,594 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1466442.0, ans=0.1 2023-06-26 01:55:23,348 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1466442.0, ans=0.05 2023-06-26 01:55:25,031 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1466442.0, ans=0.2 2023-06-26 01:55:41,295 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.313e+02 4.889e+02 7.953e+02 1.170e+03 2.853e+03, threshold=1.591e+03, percent-clipped=21.0 2023-06-26 01:56:10,088 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1466502.0, ans=0.0 2023-06-26 01:56:43,792 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1466622.0, ans=0.1 2023-06-26 01:56:59,449 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1466682.0, ans=0.0 2023-06-26 01:57:14,000 INFO [train.py:996] (0/4) Epoch 9, batch 500, loss[loss=0.1768, simple_loss=0.2651, pruned_loss=0.04424, over 21279.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2935, pruned_loss=0.06732, over 3934302.38 frames. ], batch size: 176, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 01:58:03,785 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1466862.0, ans=0.125 2023-06-26 01:58:27,977 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1466922.0, ans=0.1 2023-06-26 01:58:44,846 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.98 vs. limit=15.0 2023-06-26 01:58:49,766 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1466982.0, ans=0.125 2023-06-26 01:59:08,339 INFO [train.py:996] (0/4) Epoch 9, batch 550, loss[loss=0.2002, simple_loss=0.2845, pruned_loss=0.05795, over 21533.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2968, pruned_loss=0.06638, over 4009287.92 frames. ], batch size: 230, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 01:59:25,226 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.991e+02 4.595e+02 7.824e+02 1.104e+03 2.417e+03, threshold=1.565e+03, percent-clipped=11.0 2023-06-26 02:00:30,113 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.39 vs. limit=15.0 2023-06-26 02:00:53,336 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1467282.0, ans=0.125 2023-06-26 02:01:03,296 INFO [train.py:996] (0/4) Epoch 9, batch 600, loss[loss=0.2306, simple_loss=0.3277, pruned_loss=0.06673, over 21429.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.3016, pruned_loss=0.06766, over 4073619.83 frames. ], batch size: 211, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 02:01:21,142 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1467402.0, ans=0.0 2023-06-26 02:01:28,320 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 02:02:15,919 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1467522.0, ans=0.0 2023-06-26 02:02:25,001 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1467582.0, ans=0.125 2023-06-26 02:02:47,026 INFO [train.py:996] (0/4) Epoch 9, batch 650, loss[loss=0.1941, simple_loss=0.2826, pruned_loss=0.05275, over 21730.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.3016, pruned_loss=0.06834, over 4115711.11 frames. ], batch size: 282, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 02:03:01,072 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1467642.0, ans=0.0 2023-06-26 02:03:03,589 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.173e+02 5.371e+02 7.433e+02 1.361e+03 3.228e+03, threshold=1.487e+03, percent-clipped=18.0 2023-06-26 02:03:28,698 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1467702.0, ans=0.125 2023-06-26 02:03:32,339 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1467702.0, ans=0.0 2023-06-26 02:03:41,992 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.60 vs. limit=15.0 2023-06-26 02:04:03,890 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1467822.0, ans=0.0 2023-06-26 02:04:13,938 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1467882.0, ans=0.05 2023-06-26 02:04:44,098 INFO [train.py:996] (0/4) Epoch 9, batch 700, loss[loss=0.1903, simple_loss=0.2561, pruned_loss=0.06225, over 21213.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.2998, pruned_loss=0.06902, over 4158621.47 frames. ], batch size: 143, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 02:05:07,717 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1468002.0, ans=0.125 2023-06-26 02:05:30,740 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.68 vs. limit=22.5 2023-06-26 02:05:34,965 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1468062.0, ans=0.1 2023-06-26 02:05:55,914 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1468122.0, ans=0.125 2023-06-26 02:05:59,175 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1468182.0, ans=0.125 2023-06-26 02:06:31,665 INFO [train.py:996] (0/4) Epoch 9, batch 750, loss[loss=0.2369, simple_loss=0.3616, pruned_loss=0.05605, over 19803.00 frames. ], tot_loss[loss=0.219, simple_loss=0.2994, pruned_loss=0.06929, over 4166889.00 frames. ], batch size: 703, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 02:06:42,121 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.730e+02 4.754e+02 6.417e+02 9.585e+02 1.882e+03, threshold=1.283e+03, percent-clipped=6.0 2023-06-26 02:07:29,729 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1468422.0, ans=0.125 2023-06-26 02:07:40,577 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1468422.0, ans=0.2 2023-06-26 02:07:43,881 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1468422.0, ans=0.125 2023-06-26 02:08:09,153 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1468542.0, ans=0.035 2023-06-26 02:08:10,193 INFO [train.py:996] (0/4) Epoch 9, batch 800, loss[loss=0.2287, simple_loss=0.3054, pruned_loss=0.07596, over 21544.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2966, pruned_loss=0.06959, over 4195931.33 frames. ], batch size: 441, lr: 3.38e-03, grad_scale: 32.0 2023-06-26 02:08:43,445 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1468602.0, ans=0.2 2023-06-26 02:09:11,836 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1468662.0, ans=0.0 2023-06-26 02:10:07,719 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 02:10:09,766 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1468842.0, ans=0.125 2023-06-26 02:10:10,648 INFO [train.py:996] (0/4) Epoch 9, batch 850, loss[loss=0.2077, simple_loss=0.2813, pruned_loss=0.06707, over 21187.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.296, pruned_loss=0.06946, over 4219949.50 frames. ], batch size: 176, lr: 3.38e-03, grad_scale: 32.0 2023-06-26 02:10:18,330 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1468842.0, ans=0.0 2023-06-26 02:10:26,297 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.492e+02 5.225e+02 7.900e+02 1.161e+03 2.208e+03, threshold=1.580e+03, percent-clipped=19.0 2023-06-26 02:11:01,313 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1468962.0, ans=0.125 2023-06-26 02:11:06,320 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1468962.0, ans=0.125 2023-06-26 02:11:08,574 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.50 vs. limit=15.0 2023-06-26 02:11:59,368 INFO [train.py:996] (0/4) Epoch 9, batch 900, loss[loss=0.2141, simple_loss=0.2814, pruned_loss=0.07334, over 21303.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.2931, pruned_loss=0.06832, over 4233768.56 frames. ], batch size: 176, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 02:12:50,663 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1469262.0, ans=0.0 2023-06-26 02:13:30,174 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1469382.0, ans=0.125 2023-06-26 02:13:36,936 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 02:13:48,837 INFO [train.py:996] (0/4) Epoch 9, batch 950, loss[loss=0.1979, simple_loss=0.2603, pruned_loss=0.06771, over 21760.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2916, pruned_loss=0.06762, over 4248838.09 frames. ], batch size: 300, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 02:13:55,759 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.51 vs. limit=15.0 2023-06-26 02:14:01,419 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.026e+02 4.404e+02 7.084e+02 1.100e+03 2.197e+03, threshold=1.417e+03, percent-clipped=5.0 2023-06-26 02:14:12,612 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 02:14:12,698 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1469502.0, ans=0.1 2023-06-26 02:14:26,361 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1469562.0, ans=0.125 2023-06-26 02:14:42,904 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1469622.0, ans=0.125 2023-06-26 02:14:45,303 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.49 vs. limit=12.0 2023-06-26 02:15:36,815 INFO [train.py:996] (0/4) Epoch 9, batch 1000, loss[loss=0.2146, simple_loss=0.2987, pruned_loss=0.06525, over 21609.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2893, pruned_loss=0.06698, over 4260450.61 frames. ], batch size: 441, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 02:15:46,335 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1469742.0, ans=0.125 2023-06-26 02:16:05,668 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 02:16:09,601 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1469802.0, ans=0.2 2023-06-26 02:16:14,723 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1469862.0, ans=0.125 2023-06-26 02:16:20,210 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1469862.0, ans=0.0 2023-06-26 02:17:01,317 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1469922.0, ans=0.0 2023-06-26 02:17:04,716 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1469982.0, ans=0.2 2023-06-26 02:17:09,086 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.09 vs. limit=6.0 2023-06-26 02:17:27,499 INFO [train.py:996] (0/4) Epoch 9, batch 1050, loss[loss=0.2275, simple_loss=0.2962, pruned_loss=0.07938, over 21817.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.29, pruned_loss=0.06758, over 4270062.69 frames. ], batch size: 441, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 02:17:39,396 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.206e+02 4.347e+02 6.082e+02 9.446e+02 2.534e+03, threshold=1.216e+03, percent-clipped=8.0 2023-06-26 02:17:45,788 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1470102.0, ans=0.0 2023-06-26 02:19:18,805 INFO [train.py:996] (0/4) Epoch 9, batch 1100, loss[loss=0.2111, simple_loss=0.2572, pruned_loss=0.08256, over 20147.00 frames. ], tot_loss[loss=0.2129, simple_loss=0.2904, pruned_loss=0.06771, over 4263032.14 frames. ], batch size: 702, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 02:20:35,315 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.94 vs. limit=6.0 2023-06-26 02:20:53,997 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1470582.0, ans=0.125 2023-06-26 02:21:09,481 INFO [train.py:996] (0/4) Epoch 9, batch 1150, loss[loss=0.2164, simple_loss=0.2873, pruned_loss=0.07277, over 21880.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2903, pruned_loss=0.06761, over 4267950.59 frames. ], batch size: 124, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 02:21:22,269 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.381e+02 4.817e+02 6.167e+02 1.033e+03 2.052e+03, threshold=1.233e+03, percent-clipped=13.0 2023-06-26 02:21:24,754 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1470642.0, ans=0.125 2023-06-26 02:21:28,115 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1470702.0, ans=0.0 2023-06-26 02:21:52,166 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1470762.0, ans=0.125 2023-06-26 02:22:50,383 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1470882.0, ans=0.125 2023-06-26 02:23:00,183 INFO [train.py:996] (0/4) Epoch 9, batch 1200, loss[loss=0.1741, simple_loss=0.2618, pruned_loss=0.04319, over 21503.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2926, pruned_loss=0.06876, over 4277614.36 frames. ], batch size: 212, lr: 3.38e-03, grad_scale: 32.0 2023-06-26 02:23:43,322 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.83 vs. limit=12.0 2023-06-26 02:23:43,373 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.52 vs. limit=15.0 2023-06-26 02:24:31,257 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1471122.0, ans=0.125 2023-06-26 02:24:52,854 INFO [train.py:996] (0/4) Epoch 9, batch 1250, loss[loss=0.2697, simple_loss=0.3583, pruned_loss=0.09058, over 21720.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2958, pruned_loss=0.06969, over 4286411.05 frames. ], batch size: 351, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 02:25:06,493 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.265e+02 4.578e+02 6.578e+02 9.426e+02 2.383e+03, threshold=1.316e+03, percent-clipped=14.0 2023-06-26 02:25:27,459 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1471302.0, ans=0.2 2023-06-26 02:26:21,557 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.46 vs. limit=15.0 2023-06-26 02:26:24,937 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1471482.0, ans=0.1 2023-06-26 02:26:31,367 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1471482.0, ans=0.1 2023-06-26 02:26:31,399 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1471482.0, ans=0.0 2023-06-26 02:26:40,211 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1471482.0, ans=0.125 2023-06-26 02:26:43,223 INFO [train.py:996] (0/4) Epoch 9, batch 1300, loss[loss=0.2055, simple_loss=0.3069, pruned_loss=0.05208, over 19870.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2946, pruned_loss=0.06974, over 4278134.96 frames. ], batch size: 703, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 02:26:47,954 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.23 vs. limit=6.0 2023-06-26 02:26:50,902 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1471542.0, ans=0.125 2023-06-26 02:27:06,883 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1471602.0, ans=0.125 2023-06-26 02:28:12,894 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1471722.0, ans=0.125 2023-06-26 02:28:32,868 INFO [train.py:996] (0/4) Epoch 9, batch 1350, loss[loss=0.2832, simple_loss=0.3432, pruned_loss=0.1116, over 21425.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.2972, pruned_loss=0.07082, over 4283164.79 frames. ], batch size: 509, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 02:28:43,771 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1471842.0, ans=0.0 2023-06-26 02:28:46,549 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.596e+02 4.887e+02 7.409e+02 1.206e+03 1.964e+03, threshold=1.482e+03, percent-clipped=19.0 2023-06-26 02:29:53,752 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1472022.0, ans=0.5 2023-06-26 02:30:22,931 INFO [train.py:996] (0/4) Epoch 9, batch 1400, loss[loss=0.2187, simple_loss=0.2886, pruned_loss=0.07437, over 21916.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.2967, pruned_loss=0.07105, over 4281403.61 frames. ], batch size: 351, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 02:30:29,191 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1472142.0, ans=0.125 2023-06-26 02:30:37,624 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1472142.0, ans=0.04949747468305833 2023-06-26 02:31:38,741 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1472322.0, ans=0.125 2023-06-26 02:31:47,794 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1472322.0, ans=0.2 2023-06-26 02:32:01,623 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1472382.0, ans=0.0 2023-06-26 02:32:04,863 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1472382.0, ans=0.2 2023-06-26 02:32:13,543 INFO [train.py:996] (0/4) Epoch 9, batch 1450, loss[loss=0.1816, simple_loss=0.2507, pruned_loss=0.05622, over 21688.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2969, pruned_loss=0.07188, over 4278441.30 frames. ], batch size: 333, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 02:32:27,126 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.492e+02 5.469e+02 8.336e+02 1.169e+03 2.052e+03, threshold=1.667e+03, percent-clipped=11.0 2023-06-26 02:32:58,927 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1472562.0, ans=0.125 2023-06-26 02:33:01,140 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.01 vs. limit=22.5 2023-06-26 02:33:15,658 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1472562.0, ans=0.125 2023-06-26 02:33:25,365 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.02 vs. limit=15.0 2023-06-26 02:33:34,825 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1472622.0, ans=0.1 2023-06-26 02:33:36,335 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1472622.0, ans=0.2 2023-06-26 02:33:57,833 INFO [train.py:996] (0/4) Epoch 9, batch 1500, loss[loss=0.1971, simple_loss=0.2611, pruned_loss=0.06651, over 21111.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.2967, pruned_loss=0.07246, over 4283064.50 frames. ], batch size: 143, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 02:34:14,623 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.73 vs. limit=22.5 2023-06-26 02:35:21,911 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1472922.0, ans=0.125 2023-06-26 02:35:44,375 INFO [train.py:996] (0/4) Epoch 9, batch 1550, loss[loss=0.1426, simple_loss=0.2155, pruned_loss=0.03481, over 16578.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2947, pruned_loss=0.07147, over 4282224.82 frames. ], batch size: 61, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 02:35:55,130 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.37 vs. limit=10.0 2023-06-26 02:35:58,924 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.109e+02 4.360e+02 5.874e+02 7.765e+02 1.799e+03, threshold=1.175e+03, percent-clipped=2.0 2023-06-26 02:37:11,406 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1473222.0, ans=0.125 2023-06-26 02:37:35,447 INFO [train.py:996] (0/4) Epoch 9, batch 1600, loss[loss=0.1382, simple_loss=0.1859, pruned_loss=0.04522, over 16393.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.2952, pruned_loss=0.07114, over 4278784.57 frames. ], batch size: 60, lr: 3.38e-03, grad_scale: 32.0 2023-06-26 02:37:36,294 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1473342.0, ans=0.125 2023-06-26 02:38:25,569 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.96 vs. limit=22.5 2023-06-26 02:38:26,616 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1473402.0, ans=0.125 2023-06-26 02:38:48,348 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1473522.0, ans=0.125 2023-06-26 02:39:07,873 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1473582.0, ans=0.125 2023-06-26 02:39:22,833 INFO [train.py:996] (0/4) Epoch 9, batch 1650, loss[loss=0.199, simple_loss=0.2736, pruned_loss=0.0622, over 21198.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2934, pruned_loss=0.06991, over 4278867.96 frames. ], batch size: 143, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 02:39:45,513 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 02:39:56,131 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.260e+02 4.603e+02 6.235e+02 9.034e+02 1.719e+03, threshold=1.247e+03, percent-clipped=11.0 2023-06-26 02:39:59,999 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1473702.0, ans=0.125 2023-06-26 02:41:03,432 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.73 vs. limit=15.0 2023-06-26 02:41:11,351 INFO [train.py:996] (0/4) Epoch 9, batch 1700, loss[loss=0.2613, simple_loss=0.3168, pruned_loss=0.1029, over 21358.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.2973, pruned_loss=0.07142, over 4279872.02 frames. ], batch size: 507, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 02:41:14,375 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.29 vs. limit=10.0 2023-06-26 02:42:29,130 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1474122.0, ans=0.2 2023-06-26 02:42:34,567 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1474122.0, ans=0.125 2023-06-26 02:42:45,715 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1474182.0, ans=0.0 2023-06-26 02:43:10,699 INFO [train.py:996] (0/4) Epoch 9, batch 1750, loss[loss=0.2215, simple_loss=0.3164, pruned_loss=0.06333, over 19866.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.2987, pruned_loss=0.07042, over 4272663.12 frames. ], batch size: 702, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 02:43:21,137 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.18 vs. limit=15.0 2023-06-26 02:43:25,662 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1474242.0, ans=0.0 2023-06-26 02:43:26,234 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.83 vs. limit=15.0 2023-06-26 02:43:26,348 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.20 vs. limit=6.0 2023-06-26 02:43:26,590 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.127e+02 4.603e+02 7.165e+02 1.089e+03 2.171e+03, threshold=1.433e+03, percent-clipped=16.0 2023-06-26 02:43:42,655 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.11 vs. limit=10.0 2023-06-26 02:44:01,298 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1474362.0, ans=0.0 2023-06-26 02:44:12,123 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1474422.0, ans=0.125 2023-06-26 02:44:58,132 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1474542.0, ans=0.125 2023-06-26 02:44:59,064 INFO [train.py:996] (0/4) Epoch 9, batch 1800, loss[loss=0.2125, simple_loss=0.3189, pruned_loss=0.05301, over 21746.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2959, pruned_loss=0.06729, over 4273221.16 frames. ], batch size: 332, lr: 3.37e-03, grad_scale: 8.0 2023-06-26 02:46:30,773 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1474782.0, ans=0.125 2023-06-26 02:46:36,919 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.60 vs. limit=6.0 2023-06-26 02:46:41,085 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1474782.0, ans=0.125 2023-06-26 02:46:49,580 INFO [train.py:996] (0/4) Epoch 9, batch 1850, loss[loss=0.2082, simple_loss=0.2937, pruned_loss=0.06138, over 21846.00 frames. ], tot_loss[loss=0.2133, simple_loss=0.296, pruned_loss=0.0653, over 4269656.34 frames. ], batch size: 282, lr: 3.37e-03, grad_scale: 8.0 2023-06-26 02:46:55,270 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1474842.0, ans=0.125 2023-06-26 02:47:01,326 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.91 vs. limit=6.0 2023-06-26 02:47:07,112 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.277e+02 4.370e+02 7.147e+02 9.387e+02 1.947e+03, threshold=1.429e+03, percent-clipped=4.0 2023-06-26 02:47:18,909 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1474902.0, ans=0.125 2023-06-26 02:47:18,917 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1474902.0, ans=0.125 2023-06-26 02:47:24,033 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1474962.0, ans=0.0 2023-06-26 02:48:17,180 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1475082.0, ans=0.1 2023-06-26 02:48:30,765 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1475082.0, ans=0.125 2023-06-26 02:48:35,242 INFO [train.py:996] (0/4) Epoch 9, batch 1900, loss[loss=0.2054, simple_loss=0.2726, pruned_loss=0.06912, over 21134.00 frames. ], tot_loss[loss=0.213, simple_loss=0.2954, pruned_loss=0.0653, over 4270516.87 frames. ], batch size: 143, lr: 3.37e-03, grad_scale: 8.0 2023-06-26 02:48:59,076 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1475202.0, ans=0.125 2023-06-26 02:50:22,001 INFO [train.py:996] (0/4) Epoch 9, batch 1950, loss[loss=0.1913, simple_loss=0.2625, pruned_loss=0.06, over 21847.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2915, pruned_loss=0.06536, over 4275711.84 frames. ], batch size: 107, lr: 3.37e-03, grad_scale: 8.0 2023-06-26 02:50:31,726 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1475442.0, ans=0.0 2023-06-26 02:50:39,720 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.119e+02 4.600e+02 6.101e+02 9.329e+02 1.931e+03, threshold=1.220e+03, percent-clipped=7.0 2023-06-26 02:50:40,608 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1475502.0, ans=0.2 2023-06-26 02:50:59,979 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1475562.0, ans=0.5 2023-06-26 02:51:03,283 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 02:51:34,236 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1475562.0, ans=0.125 2023-06-26 02:52:13,511 INFO [train.py:996] (0/4) Epoch 9, batch 2000, loss[loss=0.2502, simple_loss=0.3441, pruned_loss=0.07808, over 21823.00 frames. ], tot_loss[loss=0.2078, simple_loss=0.2878, pruned_loss=0.06391, over 4276437.14 frames. ], batch size: 372, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 02:52:35,964 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1475802.0, ans=0.125 2023-06-26 02:52:58,869 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1475862.0, ans=0.0 2023-06-26 02:53:58,863 INFO [train.py:996] (0/4) Epoch 9, batch 2050, loss[loss=0.1981, simple_loss=0.2848, pruned_loss=0.05565, over 21362.00 frames. ], tot_loss[loss=0.2095, simple_loss=0.2885, pruned_loss=0.06524, over 4278408.09 frames. ], batch size: 131, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 02:54:12,438 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.39 vs. limit=15.0 2023-06-26 02:54:16,488 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.128e+02 5.298e+02 7.792e+02 1.006e+03 2.094e+03, threshold=1.558e+03, percent-clipped=16.0 2023-06-26 02:55:53,045 INFO [train.py:996] (0/4) Epoch 9, batch 2100, loss[loss=0.2746, simple_loss=0.3358, pruned_loss=0.1067, over 21396.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.2921, pruned_loss=0.06763, over 4279776.64 frames. ], batch size: 471, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 02:55:53,581 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1476342.0, ans=0.1 2023-06-26 02:56:08,667 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.12 vs. limit=15.0 2023-06-26 02:57:09,533 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1476522.0, ans=0.1 2023-06-26 02:57:13,398 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1476522.0, ans=0.125 2023-06-26 02:57:44,910 INFO [train.py:996] (0/4) Epoch 9, batch 2150, loss[loss=0.2116, simple_loss=0.2806, pruned_loss=0.07124, over 21461.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.2923, pruned_loss=0.0688, over 4279925.45 frames. ], batch size: 389, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 02:57:45,864 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.93 vs. limit=15.0 2023-06-26 02:58:02,892 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.265e+02 5.087e+02 7.506e+02 1.094e+03 2.833e+03, threshold=1.501e+03, percent-clipped=11.0 2023-06-26 02:58:29,033 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1476702.0, ans=0.1 2023-06-26 02:58:55,495 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.37 vs. limit=15.0 2023-06-26 02:59:14,136 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff3.min_abs, batch_count=1476882.0, ans=0.2 2023-06-26 02:59:29,091 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1476882.0, ans=0.0 2023-06-26 02:59:31,671 INFO [train.py:996] (0/4) Epoch 9, batch 2200, loss[loss=0.179, simple_loss=0.2551, pruned_loss=0.0515, over 21806.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2936, pruned_loss=0.06909, over 4276691.19 frames. ], batch size: 118, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 03:01:15,203 INFO [train.py:996] (0/4) Epoch 9, batch 2250, loss[loss=0.1901, simple_loss=0.2626, pruned_loss=0.05882, over 21722.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2917, pruned_loss=0.06805, over 4286499.56 frames. ], batch size: 371, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 03:01:32,990 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.139e+02 4.755e+02 7.951e+02 1.208e+03 2.238e+03, threshold=1.590e+03, percent-clipped=7.0 2023-06-26 03:01:44,086 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1477302.0, ans=0.0 2023-06-26 03:02:01,621 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1477362.0, ans=0.125 2023-06-26 03:02:19,674 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.92 vs. limit=15.0 2023-06-26 03:02:38,506 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1477422.0, ans=0.0 2023-06-26 03:03:05,221 INFO [train.py:996] (0/4) Epoch 9, batch 2300, loss[loss=0.2364, simple_loss=0.275, pruned_loss=0.0989, over 21469.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.287, pruned_loss=0.06806, over 4287816.22 frames. ], batch size: 511, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 03:03:20,108 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1477542.0, ans=0.0 2023-06-26 03:03:50,638 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1477662.0, ans=0.125 2023-06-26 03:04:02,017 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1477662.0, ans=0.125 2023-06-26 03:04:26,488 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1477722.0, ans=0.0 2023-06-26 03:04:36,011 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.70 vs. limit=15.0 2023-06-26 03:04:51,418 INFO [train.py:996] (0/4) Epoch 9, batch 2350, loss[loss=0.2281, simple_loss=0.2902, pruned_loss=0.08298, over 21234.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.2833, pruned_loss=0.06757, over 4277540.63 frames. ], batch size: 159, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 03:04:55,825 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1477842.0, ans=0.2 2023-06-26 03:05:10,838 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.83 vs. limit=15.0 2023-06-26 03:05:12,177 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1477842.0, ans=0.125 2023-06-26 03:05:15,083 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.161e+02 4.711e+02 6.334e+02 1.025e+03 2.139e+03, threshold=1.267e+03, percent-clipped=9.0 2023-06-26 03:05:55,419 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1477962.0, ans=0.125 2023-06-26 03:05:56,163 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.53 vs. limit=15.0 2023-06-26 03:06:15,879 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.13 vs. limit=15.0 2023-06-26 03:06:20,645 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1478022.0, ans=0.125 2023-06-26 03:06:44,891 INFO [train.py:996] (0/4) Epoch 9, batch 2400, loss[loss=0.2218, simple_loss=0.3014, pruned_loss=0.07115, over 21500.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2874, pruned_loss=0.07026, over 4281195.41 frames. ], batch size: 112, lr: 3.37e-03, grad_scale: 32.0 2023-06-26 03:06:47,657 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1478142.0, ans=0.0 2023-06-26 03:06:47,665 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1478142.0, ans=0.2 2023-06-26 03:07:43,601 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1478262.0, ans=0.125 2023-06-26 03:07:45,403 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1478262.0, ans=0.07 2023-06-26 03:08:36,725 INFO [train.py:996] (0/4) Epoch 9, batch 2450, loss[loss=0.2395, simple_loss=0.3206, pruned_loss=0.07918, over 21809.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.2902, pruned_loss=0.07157, over 4281551.75 frames. ], batch size: 441, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 03:08:53,832 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1478442.0, ans=0.1 2023-06-26 03:08:53,844 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1478442.0, ans=0.125 2023-06-26 03:09:01,666 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.534e+02 5.033e+02 6.854e+02 1.116e+03 2.187e+03, threshold=1.371e+03, percent-clipped=18.0 2023-06-26 03:09:27,929 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1478562.0, ans=0.2 2023-06-26 03:09:40,170 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1478562.0, ans=0.2 2023-06-26 03:10:05,125 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1478682.0, ans=0.125 2023-06-26 03:10:13,046 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1478682.0, ans=0.1 2023-06-26 03:10:16,803 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1478682.0, ans=0.125 2023-06-26 03:10:21,208 INFO [train.py:996] (0/4) Epoch 9, batch 2500, loss[loss=0.2197, simple_loss=0.311, pruned_loss=0.06421, over 21693.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2899, pruned_loss=0.07076, over 4282835.14 frames. ], batch size: 332, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 03:11:18,463 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1478862.0, ans=0.1 2023-06-26 03:11:33,838 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1478922.0, ans=0.125 2023-06-26 03:11:47,877 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1478982.0, ans=0.125 2023-06-26 03:11:51,429 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1478982.0, ans=0.1 2023-06-26 03:12:06,837 INFO [train.py:996] (0/4) Epoch 9, batch 2550, loss[loss=0.2304, simple_loss=0.3043, pruned_loss=0.07827, over 21779.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2896, pruned_loss=0.06971, over 4268856.78 frames. ], batch size: 124, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 03:12:37,799 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.177e+02 4.403e+02 6.951e+02 9.882e+02 2.721e+03, threshold=1.390e+03, percent-clipped=12.0 2023-06-26 03:13:09,996 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=1479162.0, ans=6.0 2023-06-26 03:13:40,716 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1479282.0, ans=0.2 2023-06-26 03:13:57,245 INFO [train.py:996] (0/4) Epoch 9, batch 2600, loss[loss=0.2381, simple_loss=0.3175, pruned_loss=0.07942, over 21448.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2921, pruned_loss=0.07151, over 4270713.72 frames. ], batch size: 131, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 03:14:34,056 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1479402.0, ans=0.0 2023-06-26 03:14:34,190 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1479402.0, ans=0.1 2023-06-26 03:15:36,972 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1479582.0, ans=0.0 2023-06-26 03:15:43,480 INFO [train.py:996] (0/4) Epoch 9, batch 2650, loss[loss=0.2439, simple_loss=0.3202, pruned_loss=0.0838, over 21834.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2931, pruned_loss=0.0719, over 4275064.80 frames. ], batch size: 118, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 03:16:14,201 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.480e+02 5.388e+02 7.988e+02 1.143e+03 2.285e+03, threshold=1.598e+03, percent-clipped=12.0 2023-06-26 03:16:25,739 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1479702.0, ans=0.125 2023-06-26 03:16:42,440 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1479762.0, ans=0.2 2023-06-26 03:16:45,946 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1479762.0, ans=0.0 2023-06-26 03:16:58,969 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.02 vs. limit=6.0 2023-06-26 03:17:08,409 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1479822.0, ans=0.0 2023-06-26 03:17:28,000 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1479942.0, ans=0.0 2023-06-26 03:17:29,024 INFO [train.py:996] (0/4) Epoch 9, batch 2700, loss[loss=0.2431, simple_loss=0.3215, pruned_loss=0.08235, over 21612.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2926, pruned_loss=0.0712, over 4282156.91 frames. ], batch size: 263, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 03:17:29,357 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1479942.0, ans=0.125 2023-06-26 03:17:55,043 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1480002.0, ans=0.125 2023-06-26 03:18:25,762 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1480062.0, ans=0.125 2023-06-26 03:18:46,287 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1480122.0, ans=0.125 2023-06-26 03:18:53,309 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1480122.0, ans=0.1 2023-06-26 03:19:20,005 INFO [train.py:996] (0/4) Epoch 9, batch 2750, loss[loss=0.2001, simple_loss=0.3191, pruned_loss=0.04054, over 19726.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2925, pruned_loss=0.07072, over 4283653.96 frames. ], batch size: 702, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 03:19:51,120 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.311e+02 4.494e+02 5.812e+02 9.696e+02 2.134e+03, threshold=1.162e+03, percent-clipped=3.0 2023-06-26 03:20:45,242 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1480422.0, ans=0.0 2023-06-26 03:21:02,133 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1480482.0, ans=0.05 2023-06-26 03:21:02,244 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1480482.0, ans=0.125 2023-06-26 03:21:19,573 INFO [train.py:996] (0/4) Epoch 9, batch 2800, loss[loss=0.1884, simple_loss=0.2611, pruned_loss=0.05782, over 21435.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.2991, pruned_loss=0.07254, over 4288285.45 frames. ], batch size: 212, lr: 3.37e-03, grad_scale: 32.0 2023-06-26 03:22:03,060 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.09 vs. limit=10.0 2023-06-26 03:22:36,264 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.84 vs. limit=22.5 2023-06-26 03:23:18,687 INFO [train.py:996] (0/4) Epoch 9, batch 2850, loss[loss=0.222, simple_loss=0.3267, pruned_loss=0.0587, over 20737.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.2997, pruned_loss=0.07336, over 4290128.26 frames. ], batch size: 607, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 03:23:41,753 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.99 vs. limit=10.0 2023-06-26 03:23:45,691 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.704e+02 5.417e+02 7.792e+02 1.299e+03 2.553e+03, threshold=1.558e+03, percent-clipped=28.0 2023-06-26 03:24:00,977 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.91 vs. limit=15.0 2023-06-26 03:24:16,272 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1481022.0, ans=0.1 2023-06-26 03:25:03,444 INFO [train.py:996] (0/4) Epoch 9, batch 2900, loss[loss=0.2386, simple_loss=0.314, pruned_loss=0.08159, over 21804.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.2956, pruned_loss=0.07168, over 4282945.75 frames. ], batch size: 112, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 03:25:29,049 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1481142.0, ans=0.0 2023-06-26 03:26:20,709 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1481322.0, ans=10.0 2023-06-26 03:26:49,227 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1481382.0, ans=0.125 2023-06-26 03:26:53,589 INFO [train.py:996] (0/4) Epoch 9, batch 2950, loss[loss=0.2054, simple_loss=0.2883, pruned_loss=0.06129, over 21367.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.2969, pruned_loss=0.07235, over 4288747.97 frames. ], batch size: 131, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 03:27:16,743 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1481502.0, ans=0.125 2023-06-26 03:27:21,366 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.283e+02 4.507e+02 5.801e+02 9.754e+02 1.778e+03, threshold=1.160e+03, percent-clipped=2.0 2023-06-26 03:27:42,170 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.15 vs. limit=15.0 2023-06-26 03:28:08,128 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1481622.0, ans=0.0 2023-06-26 03:28:38,643 INFO [train.py:996] (0/4) Epoch 9, batch 3000, loss[loss=0.221, simple_loss=0.3009, pruned_loss=0.07053, over 21752.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.3004, pruned_loss=0.07327, over 4284175.68 frames. ], batch size: 332, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 03:28:38,645 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-26 03:29:01,199 INFO [train.py:1028] (0/4) Epoch 9, validation: loss=0.2514, simple_loss=0.3427, pruned_loss=0.08003, over 1796401.00 frames. 2023-06-26 03:29:01,200 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23714MB 2023-06-26 03:30:13,413 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1481922.0, ans=0.125 2023-06-26 03:30:48,251 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.35 vs. limit=22.5 2023-06-26 03:30:48,447 INFO [train.py:996] (0/4) Epoch 9, batch 3050, loss[loss=0.1745, simple_loss=0.2695, pruned_loss=0.03979, over 21740.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.301, pruned_loss=0.07177, over 4282160.47 frames. ], batch size: 351, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 03:31:09,637 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.261e+02 4.684e+02 7.478e+02 1.068e+03 1.857e+03, threshold=1.496e+03, percent-clipped=20.0 2023-06-26 03:31:10,391 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1482102.0, ans=0.2 2023-06-26 03:31:20,490 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.50 vs. limit=22.5 2023-06-26 03:31:23,450 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1482162.0, ans=0.2 2023-06-26 03:31:25,433 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1482162.0, ans=0.2 2023-06-26 03:32:37,850 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1482282.0, ans=0.125 2023-06-26 03:32:42,245 INFO [train.py:996] (0/4) Epoch 9, batch 3100, loss[loss=0.1914, simple_loss=0.2771, pruned_loss=0.05285, over 21566.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.2997, pruned_loss=0.06984, over 4291608.95 frames. ], batch size: 230, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 03:32:58,584 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.49 vs. limit=22.5 2023-06-26 03:33:00,426 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.56 vs. limit=15.0 2023-06-26 03:33:17,994 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1482462.0, ans=0.5 2023-06-26 03:34:36,180 INFO [train.py:996] (0/4) Epoch 9, batch 3150, loss[loss=0.2875, simple_loss=0.3538, pruned_loss=0.1106, over 21442.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.2999, pruned_loss=0.06962, over 4289295.13 frames. ], batch size: 471, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 03:34:58,317 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.871e+02 4.385e+02 6.208e+02 9.255e+02 2.149e+03, threshold=1.242e+03, percent-clipped=3.0 2023-06-26 03:35:28,126 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1482762.0, ans=0.125 2023-06-26 03:35:44,502 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1482762.0, ans=0.0 2023-06-26 03:36:09,171 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1482882.0, ans=0.2 2023-06-26 03:36:25,734 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1482882.0, ans=0.0 2023-06-26 03:36:28,381 INFO [train.py:996] (0/4) Epoch 9, batch 3200, loss[loss=0.1888, simple_loss=0.2749, pruned_loss=0.05132, over 21785.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2978, pruned_loss=0.06908, over 4287113.34 frames. ], batch size: 247, lr: 3.36e-03, grad_scale: 32.0 2023-06-26 03:36:38,183 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1482942.0, ans=0.0 2023-06-26 03:37:03,468 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1483002.0, ans=0.125 2023-06-26 03:37:22,880 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1483062.0, ans=0.07 2023-06-26 03:37:37,282 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1483122.0, ans=0.125 2023-06-26 03:37:54,780 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1483182.0, ans=0.1 2023-06-26 03:38:13,657 INFO [train.py:996] (0/4) Epoch 9, batch 3250, loss[loss=0.2055, simple_loss=0.2629, pruned_loss=0.07403, over 21394.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.2995, pruned_loss=0.0711, over 4282554.35 frames. ], batch size: 194, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 03:38:21,090 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1483242.0, ans=0.125 2023-06-26 03:38:47,156 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.805e+02 4.877e+02 6.649e+02 1.271e+03 2.472e+03, threshold=1.330e+03, percent-clipped=27.0 2023-06-26 03:39:02,294 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1483302.0, ans=0.0 2023-06-26 03:39:07,882 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1483362.0, ans=0.0 2023-06-26 03:39:47,336 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.00 vs. limit=15.0 2023-06-26 03:40:05,665 INFO [train.py:996] (0/4) Epoch 9, batch 3300, loss[loss=0.2009, simple_loss=0.2926, pruned_loss=0.05456, over 21346.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.2961, pruned_loss=0.0705, over 4275340.49 frames. ], batch size: 211, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 03:41:11,455 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1483662.0, ans=0.0 2023-06-26 03:41:22,704 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1483722.0, ans=0.5 2023-06-26 03:41:47,772 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1483782.0, ans=0.07 2023-06-26 03:41:53,175 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1483782.0, ans=0.0 2023-06-26 03:42:03,482 INFO [train.py:996] (0/4) Epoch 9, batch 3350, loss[loss=0.2451, simple_loss=0.3304, pruned_loss=0.07991, over 21597.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.2985, pruned_loss=0.07149, over 4280556.62 frames. ], batch size: 389, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 03:42:15,629 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.05 vs. limit=15.0 2023-06-26 03:42:18,539 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1483842.0, ans=0.125 2023-06-26 03:42:36,953 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.432e+02 5.027e+02 7.904e+02 1.051e+03 2.659e+03, threshold=1.581e+03, percent-clipped=15.0 2023-06-26 03:43:04,325 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1483962.0, ans=0.2 2023-06-26 03:43:58,355 INFO [train.py:996] (0/4) Epoch 9, batch 3400, loss[loss=0.2312, simple_loss=0.3045, pruned_loss=0.07888, over 21568.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.2996, pruned_loss=0.07229, over 4288563.12 frames. ], batch size: 441, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 03:44:41,361 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1484202.0, ans=0.125 2023-06-26 03:44:41,407 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1484202.0, ans=0.0 2023-06-26 03:44:48,728 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1484262.0, ans=0.0 2023-06-26 03:44:59,414 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1484262.0, ans=0.2 2023-06-26 03:45:09,631 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.32 vs. limit=15.0 2023-06-26 03:45:51,262 INFO [train.py:996] (0/4) Epoch 9, batch 3450, loss[loss=0.194, simple_loss=0.2637, pruned_loss=0.06221, over 21871.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2937, pruned_loss=0.07073, over 4283110.28 frames. ], batch size: 98, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 03:46:06,007 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1484442.0, ans=0.125 2023-06-26 03:46:12,035 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.35 vs. limit=15.0 2023-06-26 03:46:19,085 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.13 vs. limit=15.0 2023-06-26 03:46:19,614 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.178e+02 5.080e+02 7.210e+02 9.972e+02 1.993e+03, threshold=1.442e+03, percent-clipped=4.0 2023-06-26 03:46:28,195 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.49 vs. limit=15.0 2023-06-26 03:46:33,284 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1484502.0, ans=10.0 2023-06-26 03:46:35,600 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.96 vs. limit=10.0 2023-06-26 03:47:00,596 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1484622.0, ans=0.0 2023-06-26 03:47:47,828 INFO [train.py:996] (0/4) Epoch 9, batch 3500, loss[loss=0.2604, simple_loss=0.3441, pruned_loss=0.08829, over 21737.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.3003, pruned_loss=0.07368, over 4273738.46 frames. ], batch size: 351, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 03:47:50,024 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 03:47:55,725 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1484742.0, ans=0.0 2023-06-26 03:48:38,928 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1484862.0, ans=0.5 2023-06-26 03:49:37,619 INFO [train.py:996] (0/4) Epoch 9, batch 3550, loss[loss=0.2014, simple_loss=0.2944, pruned_loss=0.05419, over 20959.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.3044, pruned_loss=0.07518, over 4276197.65 frames. ], batch size: 607, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 03:50:05,950 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.388e+02 4.836e+02 6.336e+02 9.493e+02 2.947e+03, threshold=1.267e+03, percent-clipped=8.0 2023-06-26 03:50:20,680 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1485162.0, ans=0.0 2023-06-26 03:50:23,858 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1485162.0, ans=0.0 2023-06-26 03:50:27,950 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.98 vs. limit=22.5 2023-06-26 03:51:27,705 INFO [train.py:996] (0/4) Epoch 9, batch 3600, loss[loss=0.2546, simple_loss=0.3185, pruned_loss=0.09535, over 21581.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.299, pruned_loss=0.0749, over 4268653.14 frames. ], batch size: 415, lr: 3.36e-03, grad_scale: 32.0 2023-06-26 03:51:39,884 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1485342.0, ans=0.125 2023-06-26 03:51:46,668 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1485342.0, ans=0.0 2023-06-26 03:52:00,821 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1485402.0, ans=0.1 2023-06-26 03:52:08,153 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1485462.0, ans=0.125 2023-06-26 03:52:17,113 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1485462.0, ans=0.125 2023-06-26 03:52:52,486 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1485522.0, ans=0.05 2023-06-26 03:52:54,065 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1485522.0, ans=0.125 2023-06-26 03:53:05,248 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1485582.0, ans=0.0 2023-06-26 03:53:28,621 INFO [train.py:996] (0/4) Epoch 9, batch 3650, loss[loss=0.2329, simple_loss=0.3291, pruned_loss=0.06829, over 21700.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.3015, pruned_loss=0.07614, over 4270965.11 frames. ], batch size: 441, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 03:53:29,144 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1485642.0, ans=0.0 2023-06-26 03:53:31,119 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1485642.0, ans=0.04949747468305833 2023-06-26 03:53:47,158 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1485702.0, ans=0.2 2023-06-26 03:53:53,240 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.365e+02 4.857e+02 6.488e+02 1.037e+03 3.171e+03, threshold=1.298e+03, percent-clipped=18.0 2023-06-26 03:53:57,825 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1485702.0, ans=0.0 2023-06-26 03:54:44,315 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1485822.0, ans=0.2 2023-06-26 03:55:19,270 INFO [train.py:996] (0/4) Epoch 9, batch 3700, loss[loss=0.237, simple_loss=0.3046, pruned_loss=0.08473, over 21330.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.3021, pruned_loss=0.07527, over 4271950.07 frames. ], batch size: 548, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 03:55:26,528 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1485942.0, ans=0.2 2023-06-26 03:56:38,707 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1486122.0, ans=0.125 2023-06-26 03:56:46,760 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.04 vs. limit=15.0 2023-06-26 03:57:10,165 INFO [train.py:996] (0/4) Epoch 9, batch 3750, loss[loss=0.2489, simple_loss=0.3182, pruned_loss=0.08974, over 21604.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.2993, pruned_loss=0.07469, over 4277806.14 frames. ], batch size: 508, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 03:57:35,334 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.387e+02 4.638e+02 6.369e+02 1.007e+03 1.951e+03, threshold=1.274e+03, percent-clipped=10.0 2023-06-26 03:57:55,158 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1486362.0, ans=0.1 2023-06-26 03:58:02,800 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1486362.0, ans=0.0 2023-06-26 03:58:16,672 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1486362.0, ans=0.04949747468305833 2023-06-26 03:59:00,822 INFO [train.py:996] (0/4) Epoch 9, batch 3800, loss[loss=0.2296, simple_loss=0.3093, pruned_loss=0.07498, over 21710.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.2978, pruned_loss=0.0729, over 4270507.28 frames. ], batch size: 332, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 03:59:22,702 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1486602.0, ans=0.0 2023-06-26 03:59:33,488 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1486602.0, ans=0.2 2023-06-26 04:00:03,130 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1486662.0, ans=0.125 2023-06-26 04:00:20,752 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1486722.0, ans=0.0 2023-06-26 04:00:41,529 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1486782.0, ans=10.0 2023-06-26 04:00:49,628 INFO [train.py:996] (0/4) Epoch 9, batch 3850, loss[loss=0.1897, simple_loss=0.2532, pruned_loss=0.06311, over 21991.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.2948, pruned_loss=0.07284, over 4270052.09 frames. ], batch size: 103, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 04:00:55,024 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1486842.0, ans=0.2 2023-06-26 04:01:19,294 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.209e+02 4.310e+02 5.472e+02 7.871e+02 1.774e+03, threshold=1.094e+03, percent-clipped=3.0 2023-06-26 04:02:18,534 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.96 vs. limit=22.5 2023-06-26 04:02:30,722 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1487082.0, ans=0.125 2023-06-26 04:02:39,305 INFO [train.py:996] (0/4) Epoch 9, batch 3900, loss[loss=0.2263, simple_loss=0.3129, pruned_loss=0.06988, over 21335.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2906, pruned_loss=0.07224, over 4269527.27 frames. ], batch size: 548, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 04:04:07,660 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1487322.0, ans=0.125 2023-06-26 04:04:13,277 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.07 vs. limit=10.0 2023-06-26 04:04:15,160 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.88 vs. limit=15.0 2023-06-26 04:04:27,450 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.23 vs. limit=22.5 2023-06-26 04:04:29,553 INFO [train.py:996] (0/4) Epoch 9, batch 3950, loss[loss=0.2393, simple_loss=0.3067, pruned_loss=0.08593, over 19915.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.291, pruned_loss=0.07116, over 4270622.74 frames. ], batch size: 703, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 04:04:56,958 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1487502.0, ans=0.1 2023-06-26 04:04:59,611 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.430e+02 5.269e+02 7.379e+02 1.187e+03 2.051e+03, threshold=1.476e+03, percent-clipped=29.0 2023-06-26 04:05:23,564 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.26 vs. limit=12.0 2023-06-26 04:06:21,598 INFO [train.py:996] (0/4) Epoch 9, batch 4000, loss[loss=0.221, simple_loss=0.3199, pruned_loss=0.06105, over 20764.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.286, pruned_loss=0.06813, over 4270436.71 frames. ], batch size: 608, lr: 3.36e-03, grad_scale: 32.0 2023-06-26 04:06:29,322 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1487742.0, ans=0.125 2023-06-26 04:07:08,573 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1487862.0, ans=0.0 2023-06-26 04:07:22,706 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1487862.0, ans=0.125 2023-06-26 04:07:26,617 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1487862.0, ans=0.1 2023-06-26 04:07:37,909 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.46 vs. limit=22.5 2023-06-26 04:07:52,912 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1487922.0, ans=0.0 2023-06-26 04:07:59,629 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-248000.pt 2023-06-26 04:08:05,502 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.38 vs. limit=15.0 2023-06-26 04:08:10,319 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1487982.0, ans=0.125 2023-06-26 04:08:10,326 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1487982.0, ans=0.0 2023-06-26 04:08:15,055 INFO [train.py:996] (0/4) Epoch 9, batch 4050, loss[loss=0.2348, simple_loss=0.3323, pruned_loss=0.06864, over 21611.00 frames. ], tot_loss[loss=0.2099, simple_loss=0.2858, pruned_loss=0.06699, over 4277239.21 frames. ], batch size: 414, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 04:08:41,827 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.54 vs. limit=22.5 2023-06-26 04:08:54,031 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.060e+02 4.408e+02 5.792e+02 1.027e+03 1.957e+03, threshold=1.158e+03, percent-clipped=6.0 2023-06-26 04:08:56,224 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1488102.0, ans=0.125 2023-06-26 04:09:19,094 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1488162.0, ans=0.125 2023-06-26 04:09:19,602 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.82 vs. limit=15.0 2023-06-26 04:09:35,559 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1488222.0, ans=0.0 2023-06-26 04:09:44,331 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1488222.0, ans=0.125 2023-06-26 04:10:06,505 INFO [train.py:996] (0/4) Epoch 9, batch 4100, loss[loss=0.2178, simple_loss=0.2999, pruned_loss=0.0679, over 19922.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.2874, pruned_loss=0.06762, over 4278853.38 frames. ], batch size: 702, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 04:10:43,191 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1488402.0, ans=0.1 2023-06-26 04:10:44,690 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1488402.0, ans=0.125 2023-06-26 04:11:23,932 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1488522.0, ans=0.125 2023-06-26 04:11:35,704 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.44 vs. limit=10.0 2023-06-26 04:11:45,625 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1488582.0, ans=0.0 2023-06-26 04:11:57,775 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1488642.0, ans=0.0 2023-06-26 04:11:58,928 INFO [train.py:996] (0/4) Epoch 9, batch 4150, loss[loss=0.1877, simple_loss=0.2842, pruned_loss=0.04559, over 21656.00 frames. ], tot_loss[loss=0.2094, simple_loss=0.2882, pruned_loss=0.06533, over 4276049.76 frames. ], batch size: 414, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 04:12:42,729 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.917e+02 4.750e+02 6.636e+02 9.716e+02 1.939e+03, threshold=1.327e+03, percent-clipped=13.0 2023-06-26 04:13:04,155 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.46 vs. limit=10.0 2023-06-26 04:13:19,930 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1488822.0, ans=0.04949747468305833 2023-06-26 04:13:21,590 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1488822.0, ans=0.1 2023-06-26 04:13:32,443 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1488882.0, ans=0.125 2023-06-26 04:13:52,288 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1488882.0, ans=0.125 2023-06-26 04:13:52,342 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1488882.0, ans=0.2 2023-06-26 04:13:57,468 INFO [train.py:996] (0/4) Epoch 9, batch 4200, loss[loss=0.2283, simple_loss=0.3238, pruned_loss=0.06635, over 21700.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.2885, pruned_loss=0.06579, over 4280265.53 frames. ], batch size: 332, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 04:14:06,566 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1488942.0, ans=0.125 2023-06-26 04:14:47,590 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1489002.0, ans=0.07 2023-06-26 04:14:52,818 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1489062.0, ans=0.025 2023-06-26 04:15:56,540 INFO [train.py:996] (0/4) Epoch 9, batch 4250, loss[loss=0.2502, simple_loss=0.3264, pruned_loss=0.08698, over 21808.00 frames. ], tot_loss[loss=0.216, simple_loss=0.2958, pruned_loss=0.06807, over 4282744.68 frames. ], batch size: 298, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 04:15:57,832 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.72 vs. limit=15.0 2023-06-26 04:16:25,192 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.42 vs. limit=15.0 2023-06-26 04:16:34,524 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.173e+02 6.968e+02 9.905e+02 1.425e+03 3.258e+03, threshold=1.981e+03, percent-clipped=30.0 2023-06-26 04:16:39,331 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1489302.0, ans=0.0 2023-06-26 04:17:55,658 INFO [train.py:996] (0/4) Epoch 9, batch 4300, loss[loss=0.2614, simple_loss=0.3757, pruned_loss=0.07352, over 21264.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.3004, pruned_loss=0.06908, over 4277263.67 frames. ], batch size: 549, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 04:18:09,878 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.38 vs. limit=15.0 2023-06-26 04:18:11,569 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1489542.0, ans=0.0 2023-06-26 04:18:13,185 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1489542.0, ans=0.0 2023-06-26 04:19:01,544 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1489722.0, ans=0.0 2023-06-26 04:19:19,625 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1489722.0, ans=0.125 2023-06-26 04:19:52,238 INFO [train.py:996] (0/4) Epoch 9, batch 4350, loss[loss=0.2091, simple_loss=0.2831, pruned_loss=0.06759, over 21790.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.3002, pruned_loss=0.06883, over 4271137.21 frames. ], batch size: 107, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 04:20:11,478 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.02 vs. limit=8.0 2023-06-26 04:20:18,897 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.396e+02 4.613e+02 6.929e+02 1.161e+03 2.829e+03, threshold=1.386e+03, percent-clipped=7.0 2023-06-26 04:20:56,605 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1490022.0, ans=0.0 2023-06-26 04:21:42,281 INFO [train.py:996] (0/4) Epoch 9, batch 4400, loss[loss=0.2949, simple_loss=0.3633, pruned_loss=0.1132, over 21479.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.297, pruned_loss=0.06843, over 4265464.45 frames. ], batch size: 508, lr: 3.36e-03, grad_scale: 32.0 2023-06-26 04:23:10,428 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.62 vs. limit=6.0 2023-06-26 04:23:30,941 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1490382.0, ans=10.0 2023-06-26 04:23:35,726 INFO [train.py:996] (0/4) Epoch 9, batch 4450, loss[loss=0.2743, simple_loss=0.3721, pruned_loss=0.08822, over 21698.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.3061, pruned_loss=0.07078, over 4267608.62 frames. ], batch size: 414, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 04:23:50,412 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1490442.0, ans=0.125 2023-06-26 04:24:03,477 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.467e+02 5.132e+02 7.510e+02 1.153e+03 2.650e+03, threshold=1.502e+03, percent-clipped=12.0 2023-06-26 04:24:28,489 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1490562.0, ans=0.1 2023-06-26 04:24:37,212 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1490562.0, ans=0.2 2023-06-26 04:24:44,095 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1490622.0, ans=0.125 2023-06-26 04:24:57,994 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=1490622.0, ans=10.0 2023-06-26 04:25:25,706 INFO [train.py:996] (0/4) Epoch 9, batch 4500, loss[loss=0.2043, simple_loss=0.2859, pruned_loss=0.06133, over 21480.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.3074, pruned_loss=0.07224, over 4269029.34 frames. ], batch size: 194, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 04:25:26,286 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1490742.0, ans=0.1 2023-06-26 04:25:28,122 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1490742.0, ans=0.0 2023-06-26 04:25:37,857 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.26 vs. limit=5.0 2023-06-26 04:25:38,793 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1490742.0, ans=0.125 2023-06-26 04:26:05,366 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.80 vs. limit=15.0 2023-06-26 04:26:33,149 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1490922.0, ans=0.125 2023-06-26 04:26:47,939 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.90 vs. limit=15.0 2023-06-26 04:26:50,696 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1490922.0, ans=0.125 2023-06-26 04:27:08,480 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.11 vs. limit=12.0 2023-06-26 04:27:15,816 INFO [train.py:996] (0/4) Epoch 9, batch 4550, loss[loss=0.3025, simple_loss=0.3645, pruned_loss=0.1203, over 21361.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.3092, pruned_loss=0.07258, over 4274276.98 frames. ], batch size: 507, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 04:27:23,722 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1491042.0, ans=0.0 2023-06-26 04:27:56,165 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1491102.0, ans=0.125 2023-06-26 04:28:00,502 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.475e+02 4.870e+02 6.557e+02 1.171e+03 3.635e+03, threshold=1.311e+03, percent-clipped=15.0 2023-06-26 04:28:06,796 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1491162.0, ans=0.125 2023-06-26 04:28:35,584 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.37 vs. limit=10.0 2023-06-26 04:28:43,665 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1491222.0, ans=0.125 2023-06-26 04:29:05,612 INFO [train.py:996] (0/4) Epoch 9, batch 4600, loss[loss=0.1892, simple_loss=0.2715, pruned_loss=0.05349, over 21744.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3099, pruned_loss=0.0741, over 4270970.90 frames. ], batch size: 247, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 04:29:46,642 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1491402.0, ans=0.07 2023-06-26 04:30:11,708 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1491462.0, ans=0.125 2023-06-26 04:31:00,435 INFO [train.py:996] (0/4) Epoch 9, batch 4650, loss[loss=0.1718, simple_loss=0.2425, pruned_loss=0.05054, over 21262.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.3033, pruned_loss=0.07263, over 4277269.88 frames. ], batch size: 143, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 04:31:18,852 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1491642.0, ans=0.5 2023-06-26 04:31:32,487 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1491702.0, ans=0.0 2023-06-26 04:31:38,993 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.118e+02 4.318e+02 5.535e+02 7.322e+02 1.899e+03, threshold=1.107e+03, percent-clipped=2.0 2023-06-26 04:31:41,417 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1491702.0, ans=0.2 2023-06-26 04:32:55,325 INFO [train.py:996] (0/4) Epoch 9, batch 4700, loss[loss=0.2027, simple_loss=0.2673, pruned_loss=0.06908, over 21708.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.2932, pruned_loss=0.07018, over 4277076.63 frames. ], batch size: 333, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 04:33:04,568 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 04:33:48,664 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1492062.0, ans=0.125 2023-06-26 04:34:08,746 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1492122.0, ans=0.0 2023-06-26 04:34:38,034 INFO [train.py:996] (0/4) Epoch 9, batch 4750, loss[loss=0.182, simple_loss=0.2541, pruned_loss=0.05499, over 21651.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.2874, pruned_loss=0.07007, over 4275313.89 frames. ], batch size: 282, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 04:34:55,763 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1492242.0, ans=0.125 2023-06-26 04:35:16,913 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.998e+02 4.450e+02 6.729e+02 1.004e+03 1.717e+03, threshold=1.346e+03, percent-clipped=12.0 2023-06-26 04:35:17,729 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1492302.0, ans=0.035 2023-06-26 04:35:19,450 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1492302.0, ans=0.035 2023-06-26 04:35:19,566 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1492302.0, ans=0.125 2023-06-26 04:35:57,126 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1492422.0, ans=0.125 2023-06-26 04:36:21,692 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1492482.0, ans=0.04949747468305833 2023-06-26 04:36:32,954 INFO [train.py:996] (0/4) Epoch 9, batch 4800, loss[loss=0.1957, simple_loss=0.2838, pruned_loss=0.0538, over 21454.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2872, pruned_loss=0.07025, over 4283017.05 frames. ], batch size: 194, lr: 3.35e-03, grad_scale: 32.0 2023-06-26 04:36:37,789 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.03 vs. limit=6.0 2023-06-26 04:36:39,203 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1492542.0, ans=0.125 2023-06-26 04:36:49,856 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.41 vs. limit=15.0 2023-06-26 04:37:06,131 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1492602.0, ans=0.0 2023-06-26 04:37:36,179 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1492722.0, ans=0.1 2023-06-26 04:38:21,220 INFO [train.py:996] (0/4) Epoch 9, batch 4850, loss[loss=0.2362, simple_loss=0.3068, pruned_loss=0.08283, over 21739.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2876, pruned_loss=0.0702, over 4276572.83 frames. ], batch size: 441, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 04:38:22,394 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.64 vs. limit=22.5 2023-06-26 04:38:32,537 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 04:38:43,141 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1492902.0, ans=0.1 2023-06-26 04:38:56,365 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.280e+02 4.162e+02 5.020e+02 8.423e+02 2.243e+03, threshold=1.004e+03, percent-clipped=7.0 2023-06-26 04:40:11,775 INFO [train.py:996] (0/4) Epoch 9, batch 4900, loss[loss=0.2301, simple_loss=0.3174, pruned_loss=0.07147, over 21587.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.2879, pruned_loss=0.07065, over 4278650.99 frames. ], batch size: 230, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 04:40:19,304 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1493142.0, ans=0.125 2023-06-26 04:42:01,683 INFO [train.py:996] (0/4) Epoch 9, batch 4950, loss[loss=0.2177, simple_loss=0.3123, pruned_loss=0.06157, over 21560.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2903, pruned_loss=0.06849, over 4273122.68 frames. ], batch size: 441, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 04:42:42,357 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.982e+02 5.004e+02 7.690e+02 1.209e+03 2.410e+03, threshold=1.538e+03, percent-clipped=31.0 2023-06-26 04:42:44,881 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1493502.0, ans=0.0 2023-06-26 04:42:46,766 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1493562.0, ans=0.2 2023-06-26 04:42:58,899 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.51 vs. limit=15.0 2023-06-26 04:43:00,693 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1493562.0, ans=0.1 2023-06-26 04:43:03,940 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1493622.0, ans=0.07 2023-06-26 04:43:22,456 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1493682.0, ans=0.0 2023-06-26 04:43:49,279 INFO [train.py:996] (0/4) Epoch 9, batch 5000, loss[loss=0.2124, simple_loss=0.2957, pruned_loss=0.06457, over 21850.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.2908, pruned_loss=0.06545, over 4271772.47 frames. ], batch size: 282, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 04:45:37,516 INFO [train.py:996] (0/4) Epoch 9, batch 5050, loss[loss=0.1958, simple_loss=0.2692, pruned_loss=0.06123, over 21349.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2915, pruned_loss=0.06763, over 4285392.81 frames. ], batch size: 176, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 04:46:12,687 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.467e+02 4.718e+02 6.361e+02 8.600e+02 1.640e+03, threshold=1.272e+03, percent-clipped=2.0 2023-06-26 04:46:30,251 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.38 vs. limit=15.0 2023-06-26 04:46:40,117 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1494222.0, ans=0.0 2023-06-26 04:47:26,060 INFO [train.py:996] (0/4) Epoch 9, batch 5100, loss[loss=0.2315, simple_loss=0.3001, pruned_loss=0.08148, over 21593.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2912, pruned_loss=0.06828, over 4283818.63 frames. ], batch size: 471, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 04:47:56,467 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1494402.0, ans=0.0 2023-06-26 04:48:11,953 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.83 vs. limit=15.0 2023-06-26 04:48:30,273 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1494522.0, ans=0.2 2023-06-26 04:48:59,851 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1494582.0, ans=0.0 2023-06-26 04:49:09,891 INFO [train.py:996] (0/4) Epoch 9, batch 5150, loss[loss=0.2119, simple_loss=0.2769, pruned_loss=0.07345, over 21457.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2898, pruned_loss=0.06886, over 4292341.15 frames. ], batch size: 194, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 04:49:35,024 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1494642.0, ans=0.0 2023-06-26 04:49:50,346 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.908e+02 4.572e+02 6.344e+02 1.136e+03 2.635e+03, threshold=1.269e+03, percent-clipped=18.0 2023-06-26 04:49:51,198 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1494702.0, ans=0.125 2023-06-26 04:50:19,276 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1494822.0, ans=0.125 2023-06-26 04:50:40,596 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1494822.0, ans=0.09899494936611666 2023-06-26 04:51:10,734 INFO [train.py:996] (0/4) Epoch 9, batch 5200, loss[loss=0.2023, simple_loss=0.2852, pruned_loss=0.05968, over 21204.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.2925, pruned_loss=0.06954, over 4291218.55 frames. ], batch size: 144, lr: 3.35e-03, grad_scale: 32.0 2023-06-26 04:51:58,924 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1495062.0, ans=0.125 2023-06-26 04:52:55,645 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1495182.0, ans=0.0 2023-06-26 04:52:58,605 INFO [train.py:996] (0/4) Epoch 9, batch 5250, loss[loss=0.1993, simple_loss=0.2835, pruned_loss=0.05756, over 21581.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2977, pruned_loss=0.06774, over 4275506.27 frames. ], batch size: 230, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 04:52:59,180 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1495242.0, ans=0.2 2023-06-26 04:53:06,483 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1495242.0, ans=0.125 2023-06-26 04:53:35,846 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.142e+02 4.723e+02 6.704e+02 8.682e+02 1.617e+03, threshold=1.341e+03, percent-clipped=7.0 2023-06-26 04:53:45,821 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1495362.0, ans=0.125 2023-06-26 04:53:45,876 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1495362.0, ans=0.125 2023-06-26 04:53:54,611 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1495362.0, ans=0.0 2023-06-26 04:54:17,409 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.77 vs. limit=22.5 2023-06-26 04:54:33,642 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1495482.0, ans=0.125 2023-06-26 04:54:46,063 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1495482.0, ans=0.0 2023-06-26 04:54:50,694 INFO [train.py:996] (0/4) Epoch 9, batch 5300, loss[loss=0.2026, simple_loss=0.2783, pruned_loss=0.06346, over 21687.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2982, pruned_loss=0.06854, over 4275161.36 frames. ], batch size: 230, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 04:54:53,090 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 04:55:09,603 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1495542.0, ans=0.125 2023-06-26 04:55:29,007 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.07 vs. limit=6.0 2023-06-26 04:55:31,771 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1495662.0, ans=0.125 2023-06-26 04:56:02,049 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1495722.0, ans=0.0 2023-06-26 04:56:39,222 INFO [train.py:996] (0/4) Epoch 9, batch 5350, loss[loss=0.2076, simple_loss=0.2861, pruned_loss=0.06451, over 21911.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2967, pruned_loss=0.0693, over 4279495.53 frames. ], batch size: 351, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 04:56:39,875 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1495842.0, ans=0.125 2023-06-26 04:57:04,796 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1495902.0, ans=0.125 2023-06-26 04:57:11,290 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1495902.0, ans=0.0 2023-06-26 04:57:15,453 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.489e+02 4.386e+02 5.571e+02 7.652e+02 1.743e+03, threshold=1.114e+03, percent-clipped=3.0 2023-06-26 04:58:16,272 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1496082.0, ans=0.1 2023-06-26 04:58:17,791 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1496082.0, ans=0.125 2023-06-26 04:58:27,564 INFO [train.py:996] (0/4) Epoch 9, batch 5400, loss[loss=0.2208, simple_loss=0.3138, pruned_loss=0.06387, over 20981.00 frames. ], tot_loss[loss=0.218, simple_loss=0.296, pruned_loss=0.07, over 4280271.11 frames. ], batch size: 607, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 04:58:47,963 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1496142.0, ans=0.125 2023-06-26 04:58:51,444 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1496202.0, ans=0.1 2023-06-26 04:59:05,496 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1496202.0, ans=0.125 2023-06-26 04:59:33,572 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1496262.0, ans=0.1 2023-06-26 05:00:21,908 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1496442.0, ans=0.0 2023-06-26 05:00:22,762 INFO [train.py:996] (0/4) Epoch 9, batch 5450, loss[loss=0.2748, simple_loss=0.3604, pruned_loss=0.09456, over 21734.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.2979, pruned_loss=0.06939, over 4281339.24 frames. ], batch size: 441, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 05:00:24,936 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1496442.0, ans=0.125 2023-06-26 05:00:54,978 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.911e+02 4.664e+02 7.291e+02 1.143e+03 2.963e+03, threshold=1.458e+03, percent-clipped=26.0 2023-06-26 05:01:06,096 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 05:01:19,815 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1496562.0, ans=0.0 2023-06-26 05:01:25,928 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.43 vs. limit=15.0 2023-06-26 05:02:05,124 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1496682.0, ans=0.0 2023-06-26 05:02:08,921 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1496682.0, ans=0.125 2023-06-26 05:02:12,201 INFO [train.py:996] (0/4) Epoch 9, batch 5500, loss[loss=0.2117, simple_loss=0.3148, pruned_loss=0.05434, over 21641.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.3035, pruned_loss=0.06667, over 4283614.78 frames. ], batch size: 389, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 05:02:38,682 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1496802.0, ans=0.125 2023-06-26 05:02:56,364 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1496862.0, ans=0.125 2023-06-26 05:03:08,648 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1496862.0, ans=0.0 2023-06-26 05:03:53,548 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1496982.0, ans=0.125 2023-06-26 05:04:02,054 INFO [train.py:996] (0/4) Epoch 9, batch 5550, loss[loss=0.184, simple_loss=0.263, pruned_loss=0.05251, over 21297.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.2998, pruned_loss=0.06378, over 4276681.79 frames. ], batch size: 176, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 05:04:44,535 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.199e+02 5.816e+02 9.061e+02 1.223e+03 2.185e+03, threshold=1.812e+03, percent-clipped=16.0 2023-06-26 05:04:56,900 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.27 vs. limit=10.0 2023-06-26 05:05:01,118 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1497162.0, ans=0.015 2023-06-26 05:05:23,294 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1497222.0, ans=0.1 2023-06-26 05:05:45,698 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.52 vs. limit=15.0 2023-06-26 05:05:58,691 INFO [train.py:996] (0/4) Epoch 9, batch 5600, loss[loss=0.1911, simple_loss=0.2491, pruned_loss=0.06657, over 20750.00 frames. ], tot_loss[loss=0.21, simple_loss=0.2962, pruned_loss=0.06189, over 4274490.41 frames. ], batch size: 609, lr: 3.35e-03, grad_scale: 32.0 2023-06-26 05:06:09,145 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.78 vs. limit=22.5 2023-06-26 05:06:25,870 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1497402.0, ans=0.125 2023-06-26 05:07:17,540 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.13 vs. limit=6.0 2023-06-26 05:07:18,533 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1497522.0, ans=10.0 2023-06-26 05:07:23,470 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1497582.0, ans=0.125 2023-06-26 05:07:33,963 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1497582.0, ans=0.0 2023-06-26 05:07:45,558 INFO [train.py:996] (0/4) Epoch 9, batch 5650, loss[loss=0.237, simple_loss=0.3082, pruned_loss=0.08285, over 21749.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.3007, pruned_loss=0.06536, over 4281123.71 frames. ], batch size: 112, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 05:08:07,040 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1497702.0, ans=0.125 2023-06-26 05:08:29,190 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.963e+02 5.175e+02 8.774e+02 1.262e+03 2.376e+03, threshold=1.755e+03, percent-clipped=8.0 2023-06-26 05:08:33,678 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1497762.0, ans=0.1 2023-06-26 05:09:41,567 INFO [train.py:996] (0/4) Epoch 9, batch 5700, loss[loss=0.2162, simple_loss=0.3116, pruned_loss=0.06041, over 21645.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.3004, pruned_loss=0.06614, over 4281255.75 frames. ], batch size: 389, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 05:09:44,635 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.66 vs. limit=12.0 2023-06-26 05:10:09,435 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.52 vs. limit=22.5 2023-06-26 05:10:39,340 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1498062.0, ans=0.1 2023-06-26 05:11:03,087 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1498122.0, ans=0.125 2023-06-26 05:11:16,906 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1498182.0, ans=0.0 2023-06-26 05:11:39,515 INFO [train.py:996] (0/4) Epoch 9, batch 5750, loss[loss=0.1784, simple_loss=0.2212, pruned_loss=0.06782, over 19997.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2953, pruned_loss=0.06357, over 4280439.88 frames. ], batch size: 703, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 05:12:18,979 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.313e+02 4.582e+02 6.982e+02 1.089e+03 2.466e+03, threshold=1.396e+03, percent-clipped=2.0 2023-06-26 05:13:04,010 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1498422.0, ans=0.2 2023-06-26 05:13:31,255 INFO [train.py:996] (0/4) Epoch 9, batch 5800, loss[loss=0.2183, simple_loss=0.3122, pruned_loss=0.06222, over 21685.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.2952, pruned_loss=0.06248, over 4278748.96 frames. ], batch size: 263, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 05:14:34,886 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1498662.0, ans=0.125 2023-06-26 05:15:27,946 INFO [train.py:996] (0/4) Epoch 9, batch 5850, loss[loss=0.1501, simple_loss=0.2484, pruned_loss=0.02594, over 21345.00 frames. ], tot_loss[loss=0.2048, simple_loss=0.2929, pruned_loss=0.05832, over 4281816.18 frames. ], batch size: 194, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 05:15:38,668 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1498842.0, ans=0.125 2023-06-26 05:16:01,573 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1498902.0, ans=0.0 2023-06-26 05:16:05,868 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.859e+02 4.519e+02 6.797e+02 9.504e+02 2.240e+03, threshold=1.359e+03, percent-clipped=6.0 2023-06-26 05:16:25,633 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1498962.0, ans=0.125 2023-06-26 05:16:47,865 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1499022.0, ans=0.125 2023-06-26 05:17:15,174 INFO [train.py:996] (0/4) Epoch 9, batch 5900, loss[loss=0.2001, simple_loss=0.28, pruned_loss=0.06013, over 21882.00 frames. ], tot_loss[loss=0.1994, simple_loss=0.2884, pruned_loss=0.05513, over 4288072.35 frames. ], batch size: 316, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 05:18:00,855 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.02 vs. limit=15.0 2023-06-26 05:18:33,324 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1499322.0, ans=0.0 2023-06-26 05:18:47,510 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1499382.0, ans=0.2 2023-06-26 05:18:54,286 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1499382.0, ans=0.125 2023-06-26 05:19:01,046 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1499382.0, ans=0.125 2023-06-26 05:19:04,244 INFO [train.py:996] (0/4) Epoch 9, batch 5950, loss[loss=0.1589, simple_loss=0.2501, pruned_loss=0.03386, over 21303.00 frames. ], tot_loss[loss=0.201, simple_loss=0.2876, pruned_loss=0.05719, over 4290041.72 frames. ], batch size: 176, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 05:19:08,757 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.99 vs. limit=12.0 2023-06-26 05:19:47,179 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.880e+02 4.429e+02 6.642e+02 9.511e+02 2.071e+03, threshold=1.328e+03, percent-clipped=8.0 2023-06-26 05:19:48,180 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1499562.0, ans=0.125 2023-06-26 05:20:00,567 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.18 vs. limit=6.0 2023-06-26 05:20:12,321 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1499622.0, ans=0.1 2023-06-26 05:20:17,339 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1499622.0, ans=0.125 2023-06-26 05:20:50,664 INFO [train.py:996] (0/4) Epoch 9, batch 6000, loss[loss=0.176, simple_loss=0.2459, pruned_loss=0.05308, over 21493.00 frames. ], tot_loss[loss=0.2016, simple_loss=0.2833, pruned_loss=0.05999, over 4286503.26 frames. ], batch size: 212, lr: 3.35e-03, grad_scale: 32.0 2023-06-26 05:20:50,665 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-26 05:21:11,488 INFO [train.py:1028] (0/4) Epoch 9, validation: loss=0.2616, simple_loss=0.3531, pruned_loss=0.08508, over 1796401.00 frames. 2023-06-26 05:21:11,490 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23714MB 2023-06-26 05:21:35,193 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.88 vs. limit=10.0 2023-06-26 05:22:22,107 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1499922.0, ans=0.125 2023-06-26 05:23:08,657 INFO [train.py:996] (0/4) Epoch 9, batch 6050, loss[loss=0.1923, simple_loss=0.2616, pruned_loss=0.06146, over 21597.00 frames. ], tot_loss[loss=0.2007, simple_loss=0.2788, pruned_loss=0.06127, over 4271030.88 frames. ], batch size: 415, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 05:23:48,262 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.995e+02 4.915e+02 7.181e+02 1.064e+03 2.049e+03, threshold=1.436e+03, percent-clipped=12.0 2023-06-26 05:24:10,158 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1500222.0, ans=0.125 2023-06-26 05:24:27,932 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.54 vs. limit=15.0 2023-06-26 05:24:55,889 INFO [train.py:996] (0/4) Epoch 9, batch 6100, loss[loss=0.2013, simple_loss=0.2786, pruned_loss=0.06199, over 21462.00 frames. ], tot_loss[loss=0.2005, simple_loss=0.2789, pruned_loss=0.06101, over 4277353.59 frames. ], batch size: 194, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 05:25:01,746 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1500342.0, ans=0.125 2023-06-26 05:25:26,637 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1500402.0, ans=0.04949747468305833 2023-06-26 05:25:38,591 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1500462.0, ans=0.0 2023-06-26 05:26:06,781 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.23 vs. limit=22.5 2023-06-26 05:26:15,296 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.84 vs. limit=15.0 2023-06-26 05:26:43,545 INFO [train.py:996] (0/4) Epoch 9, batch 6150, loss[loss=0.2024, simple_loss=0.2722, pruned_loss=0.06631, over 15977.00 frames. ], tot_loss[loss=0.2036, simple_loss=0.2811, pruned_loss=0.06307, over 4276671.16 frames. ], batch size: 62, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 05:26:51,141 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1500642.0, ans=0.1 2023-06-26 05:27:23,507 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.389e+02 4.733e+02 6.899e+02 9.489e+02 3.075e+03, threshold=1.380e+03, percent-clipped=10.0 2023-06-26 05:28:32,086 INFO [train.py:996] (0/4) Epoch 9, batch 6200, loss[loss=0.216, simple_loss=0.2939, pruned_loss=0.06904, over 21374.00 frames. ], tot_loss[loss=0.2056, simple_loss=0.2839, pruned_loss=0.06366, over 4280539.82 frames. ], batch size: 176, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 05:28:36,504 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1500942.0, ans=0.1 2023-06-26 05:29:14,056 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1501062.0, ans=10.0 2023-06-26 05:29:28,705 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1501062.0, ans=0.0 2023-06-26 05:29:52,830 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1501182.0, ans=0.125 2023-06-26 05:30:01,741 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1501182.0, ans=0.125 2023-06-26 05:30:01,759 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1501182.0, ans=0.0 2023-06-26 05:30:10,207 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1501182.0, ans=0.0 2023-06-26 05:30:21,388 INFO [train.py:996] (0/4) Epoch 9, batch 6250, loss[loss=0.2674, simple_loss=0.3607, pruned_loss=0.0871, over 21495.00 frames. ], tot_loss[loss=0.2075, simple_loss=0.2882, pruned_loss=0.06339, over 4277854.10 frames. ], batch size: 507, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 05:30:30,474 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1501242.0, ans=0.1 2023-06-26 05:30:58,273 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1501302.0, ans=0.125 2023-06-26 05:31:01,154 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.531e+02 5.718e+02 9.151e+02 1.565e+03 3.193e+03, threshold=1.830e+03, percent-clipped=32.0 2023-06-26 05:31:01,856 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1501362.0, ans=0.125 2023-06-26 05:32:01,929 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1501482.0, ans=0.2 2023-06-26 05:32:07,225 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1501482.0, ans=0.09899494936611666 2023-06-26 05:32:09,868 INFO [train.py:996] (0/4) Epoch 9, batch 6300, loss[loss=0.2216, simple_loss=0.2894, pruned_loss=0.07688, over 21461.00 frames. ], tot_loss[loss=0.2075, simple_loss=0.2907, pruned_loss=0.06217, over 4280284.82 frames. ], batch size: 194, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 05:32:10,961 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.54 vs. limit=6.0 2023-06-26 05:32:12,011 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1501542.0, ans=0.0 2023-06-26 05:33:46,195 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1501782.0, ans=0.125 2023-06-26 05:34:00,237 INFO [train.py:996] (0/4) Epoch 9, batch 6350, loss[loss=0.2628, simple_loss=0.3406, pruned_loss=0.09244, over 21826.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.2957, pruned_loss=0.0667, over 4286878.77 frames. ], batch size: 118, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 05:34:13,336 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1501842.0, ans=0.125 2023-06-26 05:34:16,724 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1501842.0, ans=0.0 2023-06-26 05:34:23,953 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.85 vs. limit=15.0 2023-06-26 05:34:34,636 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1501902.0, ans=0.125 2023-06-26 05:34:40,056 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1501902.0, ans=0.125 2023-06-26 05:34:42,621 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.38 vs. limit=15.0 2023-06-26 05:34:52,169 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.804e+02 5.467e+02 7.732e+02 1.098e+03 2.787e+03, threshold=1.546e+03, percent-clipped=5.0 2023-06-26 05:34:59,695 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1501962.0, ans=0.0 2023-06-26 05:35:01,431 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1501962.0, ans=0.07 2023-06-26 05:35:55,802 INFO [train.py:996] (0/4) Epoch 9, batch 6400, loss[loss=0.2342, simple_loss=0.3113, pruned_loss=0.07853, over 21374.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.302, pruned_loss=0.07038, over 4289947.28 frames. ], batch size: 548, lr: 3.34e-03, grad_scale: 32.0 2023-06-26 05:36:11,902 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.41 vs. limit=15.0 2023-06-26 05:36:12,997 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1502142.0, ans=0.125 2023-06-26 05:36:55,979 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1502262.0, ans=0.0 2023-06-26 05:37:45,731 INFO [train.py:996] (0/4) Epoch 9, batch 6450, loss[loss=0.183, simple_loss=0.2615, pruned_loss=0.05222, over 21658.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.3042, pruned_loss=0.07047, over 4288163.59 frames. ], batch size: 247, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 05:37:46,879 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.19 vs. limit=15.0 2023-06-26 05:38:00,653 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1502442.0, ans=0.125 2023-06-26 05:38:09,253 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1502502.0, ans=0.09899494936611666 2023-06-26 05:38:33,805 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.613e+02 5.276e+02 6.947e+02 1.153e+03 2.587e+03, threshold=1.389e+03, percent-clipped=9.0 2023-06-26 05:38:49,708 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1502562.0, ans=0.0 2023-06-26 05:38:53,453 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1502622.0, ans=0.125 2023-06-26 05:39:35,587 INFO [train.py:996] (0/4) Epoch 9, batch 6500, loss[loss=0.1834, simple_loss=0.2472, pruned_loss=0.05985, over 21302.00 frames. ], tot_loss[loss=0.219, simple_loss=0.2992, pruned_loss=0.06934, over 4276245.82 frames. ], batch size: 177, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 05:40:01,230 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1502802.0, ans=0.5 2023-06-26 05:40:34,185 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1502862.0, ans=0.1 2023-06-26 05:40:52,854 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1502922.0, ans=0.125 2023-06-26 05:40:55,990 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1502922.0, ans=0.125 2023-06-26 05:40:56,124 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1502922.0, ans=0.125 2023-06-26 05:41:30,769 INFO [train.py:996] (0/4) Epoch 9, batch 6550, loss[loss=0.2462, simple_loss=0.3174, pruned_loss=0.0875, over 21754.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.2979, pruned_loss=0.06881, over 4282065.47 frames. ], batch size: 441, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 05:42:10,307 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.51 vs. limit=15.0 2023-06-26 05:42:19,571 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.317e+02 4.890e+02 6.578e+02 1.052e+03 2.225e+03, threshold=1.316e+03, percent-clipped=12.0 2023-06-26 05:43:12,667 INFO [train.py:996] (0/4) Epoch 9, batch 6600, loss[loss=0.2243, simple_loss=0.2741, pruned_loss=0.08721, over 21402.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.2929, pruned_loss=0.06852, over 4272803.95 frames. ], batch size: 509, lr: 3.34e-03, grad_scale: 8.0 2023-06-26 05:43:31,194 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.43 vs. limit=15.0 2023-06-26 05:43:43,199 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1503402.0, ans=0.125 2023-06-26 05:43:48,136 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1503402.0, ans=0.125 2023-06-26 05:44:08,025 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1503462.0, ans=0.125 2023-06-26 05:44:10,079 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=1503462.0, ans=10.0 2023-06-26 05:44:12,610 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.72 vs. limit=6.0 2023-06-26 05:44:33,851 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1503522.0, ans=0.1 2023-06-26 05:45:04,871 INFO [train.py:996] (0/4) Epoch 9, batch 6650, loss[loss=0.1673, simple_loss=0.2519, pruned_loss=0.04131, over 21782.00 frames. ], tot_loss[loss=0.2083, simple_loss=0.2858, pruned_loss=0.06536, over 4262144.56 frames. ], batch size: 352, lr: 3.34e-03, grad_scale: 8.0 2023-06-26 05:45:07,186 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1503642.0, ans=0.0 2023-06-26 05:45:18,362 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.87 vs. limit=15.0 2023-06-26 05:45:53,450 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.932e+02 4.741e+02 6.331e+02 9.151e+02 2.148e+03, threshold=1.266e+03, percent-clipped=9.0 2023-06-26 05:45:57,101 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.47 vs. limit=15.0 2023-06-26 05:46:25,127 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1503822.0, ans=0.0 2023-06-26 05:46:54,058 INFO [train.py:996] (0/4) Epoch 9, batch 6700, loss[loss=0.2166, simple_loss=0.3346, pruned_loss=0.0493, over 19780.00 frames. ], tot_loss[loss=0.2053, simple_loss=0.2823, pruned_loss=0.06414, over 4260988.10 frames. ], batch size: 702, lr: 3.34e-03, grad_scale: 8.0 2023-06-26 05:47:32,788 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1504062.0, ans=0.0 2023-06-26 05:47:46,026 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.10 vs. limit=15.0 2023-06-26 05:48:19,519 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1504182.0, ans=0.0 2023-06-26 05:48:21,387 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1504182.0, ans=0.2 2023-06-26 05:48:33,999 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.19 vs. limit=15.0 2023-06-26 05:48:36,342 INFO [train.py:996] (0/4) Epoch 9, batch 6750, loss[loss=0.2076, simple_loss=0.2858, pruned_loss=0.06471, over 21774.00 frames. ], tot_loss[loss=0.205, simple_loss=0.2803, pruned_loss=0.06486, over 4267250.56 frames. ], batch size: 112, lr: 3.34e-03, grad_scale: 8.0 2023-06-26 05:48:52,244 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=5.84 vs. limit=22.5 2023-06-26 05:48:59,405 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.52 vs. limit=15.0 2023-06-26 05:49:31,088 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.439e+02 4.588e+02 6.610e+02 8.394e+02 1.640e+03, threshold=1.322e+03, percent-clipped=2.0 2023-06-26 05:50:22,028 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1504482.0, ans=0.0 2023-06-26 05:50:29,558 INFO [train.py:996] (0/4) Epoch 9, batch 6800, loss[loss=0.2412, simple_loss=0.2863, pruned_loss=0.09807, over 21434.00 frames. ], tot_loss[loss=0.2098, simple_loss=0.285, pruned_loss=0.06724, over 4270457.45 frames. ], batch size: 508, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 05:51:27,400 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.16 vs. limit=15.0 2023-06-26 05:51:58,627 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.55 vs. limit=6.0 2023-06-26 05:52:16,606 INFO [train.py:996] (0/4) Epoch 9, batch 6850, loss[loss=0.1862, simple_loss=0.2563, pruned_loss=0.05807, over 21745.00 frames. ], tot_loss[loss=0.2089, simple_loss=0.2826, pruned_loss=0.06756, over 4270587.72 frames. ], batch size: 351, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 05:52:18,833 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1504842.0, ans=0.0 2023-06-26 05:52:20,520 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1504842.0, ans=10.0 2023-06-26 05:52:21,306 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.04 vs. limit=22.5 2023-06-26 05:52:38,297 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1504902.0, ans=0.0 2023-06-26 05:52:54,504 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.97 vs. limit=15.0 2023-06-26 05:53:05,512 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.496e+02 4.685e+02 7.280e+02 1.216e+03 2.418e+03, threshold=1.456e+03, percent-clipped=17.0 2023-06-26 05:53:39,286 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1505022.0, ans=0.125 2023-06-26 05:53:45,351 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.02 vs. limit=6.0 2023-06-26 05:54:05,603 INFO [train.py:996] (0/4) Epoch 9, batch 6900, loss[loss=0.224, simple_loss=0.2921, pruned_loss=0.07797, over 21853.00 frames. ], tot_loss[loss=0.2078, simple_loss=0.2816, pruned_loss=0.06698, over 4281341.46 frames. ], batch size: 351, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 05:54:16,003 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.whiten.whitening_limit, batch_count=1505142.0, ans=12.0 2023-06-26 05:54:17,009 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1505142.0, ans=0.0 2023-06-26 05:54:50,115 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1505262.0, ans=0.0 2023-06-26 05:55:54,118 INFO [train.py:996] (0/4) Epoch 9, batch 6950, loss[loss=0.1718, simple_loss=0.2574, pruned_loss=0.04305, over 21282.00 frames. ], tot_loss[loss=0.2056, simple_loss=0.2819, pruned_loss=0.06465, over 4278493.71 frames. ], batch size: 176, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 05:55:54,887 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1505442.0, ans=0.0 2023-06-26 05:56:01,994 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1505442.0, ans=0.125 2023-06-26 05:56:30,098 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1505502.0, ans=0.0 2023-06-26 05:56:43,261 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.242e+02 5.032e+02 6.537e+02 9.718e+02 2.265e+03, threshold=1.307e+03, percent-clipped=8.0 2023-06-26 05:57:06,271 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1505622.0, ans=0.05 2023-06-26 05:57:13,253 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1505622.0, ans=0.125 2023-06-26 05:57:38,329 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1505682.0, ans=0.125 2023-06-26 05:57:42,919 INFO [train.py:996] (0/4) Epoch 9, batch 7000, loss[loss=0.1843, simple_loss=0.2535, pruned_loss=0.05752, over 21557.00 frames. ], tot_loss[loss=0.2086, simple_loss=0.2841, pruned_loss=0.06657, over 4280326.92 frames. ], batch size: 263, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 05:57:54,695 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1505742.0, ans=0.125 2023-06-26 05:59:38,677 INFO [train.py:996] (0/4) Epoch 9, batch 7050, loss[loss=0.2046, simple_loss=0.2774, pruned_loss=0.06583, over 21829.00 frames. ], tot_loss[loss=0.206, simple_loss=0.2817, pruned_loss=0.06517, over 4269554.90 frames. ], batch size: 118, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 06:00:27,419 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.188e+02 4.829e+02 6.611e+02 8.594e+02 1.864e+03, threshold=1.322e+03, percent-clipped=11.0 2023-06-26 06:01:08,946 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1506282.0, ans=0.025 2023-06-26 06:01:20,749 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1506282.0, ans=0.0 2023-06-26 06:01:26,456 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1506342.0, ans=0.125 2023-06-26 06:01:33,082 INFO [train.py:996] (0/4) Epoch 9, batch 7100, loss[loss=0.2014, simple_loss=0.2833, pruned_loss=0.05975, over 21746.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.2863, pruned_loss=0.06692, over 4276454.78 frames. ], batch size: 247, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 06:01:33,586 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 06:02:12,095 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1506462.0, ans=0.025 2023-06-26 06:02:33,559 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1506522.0, ans=0.125 2023-06-26 06:02:44,192 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1506522.0, ans=0.1 2023-06-26 06:03:00,842 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.40 vs. limit=15.0 2023-06-26 06:03:22,335 INFO [train.py:996] (0/4) Epoch 9, batch 7150, loss[loss=0.3131, simple_loss=0.3641, pruned_loss=0.131, over 21345.00 frames. ], tot_loss[loss=0.2075, simple_loss=0.2845, pruned_loss=0.0652, over 4269121.55 frames. ], batch size: 507, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 06:04:06,078 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.994e+02 4.588e+02 6.424e+02 8.469e+02 2.110e+03, threshold=1.285e+03, percent-clipped=2.0 2023-06-26 06:04:15,762 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_ff2.min_abs, batch_count=1506762.0, ans=0.1 2023-06-26 06:04:26,327 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1506822.0, ans=0.2 2023-06-26 06:04:33,888 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.20 vs. limit=22.5 2023-06-26 06:04:49,522 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1506882.0, ans=0.125 2023-06-26 06:05:04,100 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1506882.0, ans=0.125 2023-06-26 06:05:11,707 INFO [train.py:996] (0/4) Epoch 9, batch 7200, loss[loss=0.2047, simple_loss=0.2679, pruned_loss=0.07069, over 21567.00 frames. ], tot_loss[loss=0.2107, simple_loss=0.287, pruned_loss=0.06726, over 4272424.61 frames. ], batch size: 263, lr: 3.34e-03, grad_scale: 32.0 2023-06-26 06:05:59,062 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1507062.0, ans=0.1 2023-06-26 06:06:42,193 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1507182.0, ans=0.125 2023-06-26 06:07:00,471 INFO [train.py:996] (0/4) Epoch 9, batch 7250, loss[loss=0.193, simple_loss=0.2682, pruned_loss=0.05885, over 21769.00 frames. ], tot_loss[loss=0.2077, simple_loss=0.2815, pruned_loss=0.06694, over 4275067.48 frames. ], batch size: 118, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 06:07:06,157 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1507242.0, ans=0.125 2023-06-26 06:07:25,025 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1507302.0, ans=0.125 2023-06-26 06:07:28,800 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1507302.0, ans=0.125 2023-06-26 06:07:45,479 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.193e+02 5.249e+02 7.377e+02 1.151e+03 2.707e+03, threshold=1.475e+03, percent-clipped=23.0 2023-06-26 06:08:00,618 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1507362.0, ans=0.07 2023-06-26 06:08:19,661 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1507422.0, ans=0.2 2023-06-26 06:08:48,791 INFO [train.py:996] (0/4) Epoch 9, batch 7300, loss[loss=0.1837, simple_loss=0.2478, pruned_loss=0.05979, over 21752.00 frames. ], tot_loss[loss=0.2038, simple_loss=0.2758, pruned_loss=0.0659, over 4268773.88 frames. ], batch size: 300, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 06:08:49,412 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1507542.0, ans=0.125 2023-06-26 06:09:13,692 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1507602.0, ans=0.04949747468305833 2023-06-26 06:09:19,215 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1507602.0, ans=0.0 2023-06-26 06:10:34,343 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1507782.0, ans=0.125 2023-06-26 06:10:44,121 INFO [train.py:996] (0/4) Epoch 9, batch 7350, loss[loss=0.2147, simple_loss=0.2785, pruned_loss=0.07543, over 21855.00 frames. ], tot_loss[loss=0.205, simple_loss=0.2752, pruned_loss=0.06745, over 4269769.64 frames. ], batch size: 98, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 06:10:47,731 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1507842.0, ans=0.125 2023-06-26 06:11:11,486 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1507902.0, ans=0.2 2023-06-26 06:11:30,328 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.284e+02 4.727e+02 6.627e+02 9.690e+02 1.819e+03, threshold=1.325e+03, percent-clipped=8.0 2023-06-26 06:11:31,622 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.33 vs. limit=22.5 2023-06-26 06:12:34,169 INFO [train.py:996] (0/4) Epoch 9, batch 7400, loss[loss=0.2278, simple_loss=0.325, pruned_loss=0.06536, over 21572.00 frames. ], tot_loss[loss=0.209, simple_loss=0.2809, pruned_loss=0.06851, over 4272675.96 frames. ], batch size: 441, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 06:12:59,451 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1508202.0, ans=0.125 2023-06-26 06:13:00,979 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1508202.0, ans=0.125 2023-06-26 06:13:45,003 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1508322.0, ans=0.1 2023-06-26 06:13:45,005 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1508322.0, ans=0.125 2023-06-26 06:13:45,553 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.32 vs. limit=15.0 2023-06-26 06:14:25,324 INFO [train.py:996] (0/4) Epoch 9, batch 7450, loss[loss=0.2087, simple_loss=0.2955, pruned_loss=0.06091, over 19966.00 frames. ], tot_loss[loss=0.2076, simple_loss=0.28, pruned_loss=0.06761, over 4266315.66 frames. ], batch size: 703, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 06:14:36,547 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1508442.0, ans=0.0 2023-06-26 06:14:40,462 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1508442.0, ans=0.125 2023-06-26 06:15:01,333 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.25 vs. limit=15.0 2023-06-26 06:15:23,589 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.343e+02 4.976e+02 6.577e+02 1.050e+03 2.324e+03, threshold=1.315e+03, percent-clipped=17.0 2023-06-26 06:15:29,166 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.92 vs. limit=10.0 2023-06-26 06:15:43,225 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1508622.0, ans=0.125 2023-06-26 06:16:18,121 INFO [train.py:996] (0/4) Epoch 9, batch 7500, loss[loss=0.2078, simple_loss=0.2775, pruned_loss=0.06902, over 21436.00 frames. ], tot_loss[loss=0.2119, simple_loss=0.2867, pruned_loss=0.0686, over 4267146.72 frames. ], batch size: 211, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 06:16:18,885 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1508742.0, ans=0.0 2023-06-26 06:16:40,316 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1508802.0, ans=0.125 2023-06-26 06:17:42,307 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.17 vs. limit=10.0 2023-06-26 06:17:45,090 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1508922.0, ans=0.125 2023-06-26 06:18:08,920 INFO [train.py:996] (0/4) Epoch 9, batch 7550, loss[loss=0.1779, simple_loss=0.2662, pruned_loss=0.04474, over 21421.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.2923, pruned_loss=0.06757, over 4266892.40 frames. ], batch size: 194, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 06:18:09,541 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1509042.0, ans=0.1 2023-06-26 06:18:18,775 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1509042.0, ans=0.2 2023-06-26 06:18:25,456 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1509042.0, ans=0.5 2023-06-26 06:18:58,554 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 06:19:04,854 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.393e+02 6.002e+02 8.588e+02 1.350e+03 2.877e+03, threshold=1.718e+03, percent-clipped=25.0 2023-06-26 06:19:38,754 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1509282.0, ans=0.0 2023-06-26 06:19:51,459 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.64 vs. limit=12.0 2023-06-26 06:19:56,708 INFO [train.py:996] (0/4) Epoch 9, batch 7600, loss[loss=0.2151, simple_loss=0.2762, pruned_loss=0.07701, over 21318.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.2906, pruned_loss=0.06621, over 4268326.12 frames. ], batch size: 176, lr: 3.33e-03, grad_scale: 32.0 2023-06-26 06:20:17,337 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1509342.0, ans=0.125 2023-06-26 06:20:24,146 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1509402.0, ans=0.07 2023-06-26 06:20:24,219 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1509402.0, ans=0.0 2023-06-26 06:21:02,781 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1509462.0, ans=0.125 2023-06-26 06:21:37,845 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1509582.0, ans=0.125 2023-06-26 06:21:46,196 INFO [train.py:996] (0/4) Epoch 9, batch 7650, loss[loss=0.2524, simple_loss=0.3004, pruned_loss=0.1022, over 21811.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2899, pruned_loss=0.06777, over 4272727.74 frames. ], batch size: 508, lr: 3.33e-03, grad_scale: 16.0 2023-06-26 06:21:53,882 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1509642.0, ans=0.1 2023-06-26 06:22:43,951 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.446e+02 5.104e+02 7.952e+02 1.146e+03 1.972e+03, threshold=1.590e+03, percent-clipped=6.0 2023-06-26 06:23:26,976 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1509882.0, ans=0.2 2023-06-26 06:23:40,362 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1509942.0, ans=0.125 2023-06-26 06:23:41,273 INFO [train.py:996] (0/4) Epoch 9, batch 7700, loss[loss=0.2444, simple_loss=0.3207, pruned_loss=0.08405, over 21478.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2933, pruned_loss=0.07075, over 4280568.19 frames. ], batch size: 194, lr: 3.33e-03, grad_scale: 16.0 2023-06-26 06:24:17,895 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.09 vs. limit=10.0 2023-06-26 06:24:28,256 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1510062.0, ans=0.0 2023-06-26 06:24:53,796 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1510122.0, ans=0.0 2023-06-26 06:25:07,501 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.59 vs. limit=10.0 2023-06-26 06:25:33,219 INFO [train.py:996] (0/4) Epoch 9, batch 7750, loss[loss=0.1889, simple_loss=0.2431, pruned_loss=0.0673, over 20894.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.3005, pruned_loss=0.07221, over 4276752.76 frames. ], batch size: 608, lr: 3.33e-03, grad_scale: 16.0 2023-06-26 06:25:45,904 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.69 vs. limit=22.5 2023-06-26 06:26:04,489 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1510302.0, ans=0.125 2023-06-26 06:26:09,429 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=1510302.0, ans=10.0 2023-06-26 06:26:32,282 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.490e+02 5.408e+02 8.578e+02 1.362e+03 2.742e+03, threshold=1.716e+03, percent-clipped=14.0 2023-06-26 06:27:08,192 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1510482.0, ans=0.1 2023-06-26 06:27:17,447 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1510482.0, ans=0.0 2023-06-26 06:27:34,378 INFO [train.py:996] (0/4) Epoch 9, batch 7800, loss[loss=0.1815, simple_loss=0.2268, pruned_loss=0.06805, over 21859.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.3006, pruned_loss=0.07233, over 4277399.18 frames. ], batch size: 98, lr: 3.33e-03, grad_scale: 16.0 2023-06-26 06:28:22,241 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1510662.0, ans=0.125 2023-06-26 06:29:23,916 INFO [train.py:996] (0/4) Epoch 9, batch 7850, loss[loss=0.2043, simple_loss=0.2684, pruned_loss=0.07008, over 21228.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2923, pruned_loss=0.07125, over 4257784.11 frames. ], batch size: 177, lr: 3.33e-03, grad_scale: 8.0 2023-06-26 06:29:58,204 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.79 vs. limit=12.0 2023-06-26 06:29:59,433 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1510902.0, ans=0.0 2023-06-26 06:30:12,951 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.204e+02 4.902e+02 7.462e+02 1.114e+03 2.139e+03, threshold=1.492e+03, percent-clipped=5.0 2023-06-26 06:31:15,010 INFO [train.py:996] (0/4) Epoch 9, batch 7900, loss[loss=0.1864, simple_loss=0.2492, pruned_loss=0.06175, over 21428.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.29, pruned_loss=0.07095, over 4262605.90 frames. ], batch size: 212, lr: 3.33e-03, grad_scale: 8.0 2023-06-26 06:31:21,838 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.02 vs. limit=15.0 2023-06-26 06:32:24,673 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1511322.0, ans=0.1 2023-06-26 06:32:24,835 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1511322.0, ans=0.04949747468305833 2023-06-26 06:32:28,540 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1511322.0, ans=0.2 2023-06-26 06:33:07,278 INFO [train.py:996] (0/4) Epoch 9, batch 7950, loss[loss=0.2605, simple_loss=0.3408, pruned_loss=0.09011, over 21611.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.2924, pruned_loss=0.07024, over 4260675.49 frames. ], batch size: 507, lr: 3.33e-03, grad_scale: 8.0 2023-06-26 06:34:02,993 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.829e+02 6.422e+02 9.281e+02 1.330e+03 3.368e+03, threshold=1.856e+03, percent-clipped=18.0 2023-06-26 06:34:44,201 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 06:34:53,391 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1511682.0, ans=0.125 2023-06-26 06:35:05,372 INFO [train.py:996] (0/4) Epoch 9, batch 8000, loss[loss=0.216, simple_loss=0.3025, pruned_loss=0.06475, over 21783.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.296, pruned_loss=0.0706, over 4253349.98 frames. ], batch size: 282, lr: 3.33e-03, grad_scale: 16.0 2023-06-26 06:36:10,559 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1511862.0, ans=0.125 2023-06-26 06:36:12,583 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1511862.0, ans=0.2 2023-06-26 06:36:46,193 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-252000.pt 2023-06-26 06:37:01,455 INFO [train.py:996] (0/4) Epoch 9, batch 8050, loss[loss=0.2075, simple_loss=0.2744, pruned_loss=0.07027, over 21449.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.3001, pruned_loss=0.07127, over 4252271.48 frames. ], batch size: 194, lr: 3.33e-03, grad_scale: 16.0 2023-06-26 06:38:01,847 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.472e+02 6.276e+02 8.546e+02 1.348e+03 3.651e+03, threshold=1.709e+03, percent-clipped=15.0 2023-06-26 06:38:32,940 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1512282.0, ans=0.1 2023-06-26 06:38:45,470 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.14 vs. limit=15.0 2023-06-26 06:38:51,623 INFO [train.py:996] (0/4) Epoch 9, batch 8100, loss[loss=0.2079, simple_loss=0.2909, pruned_loss=0.06245, over 21557.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.2978, pruned_loss=0.072, over 4261341.14 frames. ], batch size: 131, lr: 3.33e-03, grad_scale: 16.0 2023-06-26 06:40:19,817 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 06:40:58,183 INFO [train.py:996] (0/4) Epoch 9, batch 8150, loss[loss=0.1898, simple_loss=0.2637, pruned_loss=0.058, over 21247.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.3053, pruned_loss=0.07347, over 4262822.86 frames. ], batch size: 159, lr: 3.33e-03, grad_scale: 8.0 2023-06-26 06:41:11,816 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1512642.0, ans=0.125 2023-06-26 06:41:24,872 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.65 vs. limit=22.5 2023-06-26 06:41:54,139 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.538e+02 6.819e+02 1.034e+03 1.568e+03 4.387e+03, threshold=2.069e+03, percent-clipped=18.0 2023-06-26 06:41:57,499 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.65 vs. limit=15.0 2023-06-26 06:42:00,879 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.88 vs. limit=22.5 2023-06-26 06:42:49,130 INFO [train.py:996] (0/4) Epoch 9, batch 8200, loss[loss=0.1965, simple_loss=0.2588, pruned_loss=0.06709, over 21321.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.2978, pruned_loss=0.07117, over 4249470.76 frames. ], batch size: 551, lr: 3.33e-03, grad_scale: 8.0 2023-06-26 06:43:04,343 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.43 vs. limit=15.0 2023-06-26 06:43:19,730 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1513002.0, ans=0.1 2023-06-26 06:43:43,181 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.15 vs. limit=12.0 2023-06-26 06:43:54,095 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.53 vs. limit=15.0 2023-06-26 06:43:59,985 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.99 vs. limit=6.0 2023-06-26 06:44:12,269 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.21 vs. limit=22.5 2023-06-26 06:44:40,643 INFO [train.py:996] (0/4) Epoch 9, batch 8250, loss[loss=0.254, simple_loss=0.3471, pruned_loss=0.08039, over 21619.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.2986, pruned_loss=0.07252, over 4251921.83 frames. ], batch size: 441, lr: 3.33e-03, grad_scale: 8.0 2023-06-26 06:44:53,487 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1513242.0, ans=0.0 2023-06-26 06:45:07,828 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1513302.0, ans=0.0 2023-06-26 06:45:19,937 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1513302.0, ans=0.0 2023-06-26 06:45:36,648 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.396e+02 4.867e+02 7.289e+02 1.042e+03 1.970e+03, threshold=1.458e+03, percent-clipped=0.0 2023-06-26 06:45:52,831 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1513422.0, ans=0.1 2023-06-26 06:46:35,328 INFO [train.py:996] (0/4) Epoch 9, batch 8300, loss[loss=0.1882, simple_loss=0.2691, pruned_loss=0.05365, over 21438.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.2964, pruned_loss=0.07016, over 4255615.66 frames. ], batch size: 195, lr: 3.33e-03, grad_scale: 8.0 2023-06-26 06:46:37,498 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1513542.0, ans=0.2 2023-06-26 06:46:49,706 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1513542.0, ans=0.125 2023-06-26 06:47:14,547 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.49 vs. limit=15.0 2023-06-26 06:47:21,336 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1513662.0, ans=0.0 2023-06-26 06:48:25,380 INFO [train.py:996] (0/4) Epoch 9, batch 8350, loss[loss=0.1994, simple_loss=0.2806, pruned_loss=0.05905, over 21801.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.2955, pruned_loss=0.06899, over 4252333.87 frames. ], batch size: 317, lr: 3.33e-03, grad_scale: 8.0 2023-06-26 06:48:42,188 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1513902.0, ans=0.05 2023-06-26 06:48:48,862 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1513902.0, ans=0.2 2023-06-26 06:49:22,436 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.156e+02 5.177e+02 7.489e+02 1.153e+03 2.858e+03, threshold=1.498e+03, percent-clipped=11.0 2023-06-26 06:49:31,839 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 06:50:01,306 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1514082.0, ans=0.0 2023-06-26 06:50:01,415 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1514082.0, ans=0.0 2023-06-26 06:50:01,984 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.96 vs. limit=6.0 2023-06-26 06:50:14,403 INFO [train.py:996] (0/4) Epoch 9, batch 8400, loss[loss=0.1912, simple_loss=0.2714, pruned_loss=0.05549, over 21276.00 frames. ], tot_loss[loss=0.2125, simple_loss=0.2919, pruned_loss=0.06654, over 4253834.56 frames. ], batch size: 176, lr: 3.33e-03, grad_scale: 16.0 2023-06-26 06:50:23,699 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1514142.0, ans=0.125 2023-06-26 06:50:39,690 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.88 vs. limit=15.0 2023-06-26 06:52:01,966 INFO [train.py:996] (0/4) Epoch 9, batch 8450, loss[loss=0.2095, simple_loss=0.2828, pruned_loss=0.06809, over 21862.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.2913, pruned_loss=0.06606, over 4260954.47 frames. ], batch size: 351, lr: 3.33e-03, grad_scale: 16.0 2023-06-26 06:52:07,336 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1514442.0, ans=0.125 2023-06-26 06:52:58,303 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.897e+02 4.182e+02 5.654e+02 7.712e+02 3.428e+03, threshold=1.131e+03, percent-clipped=11.0 2023-06-26 06:53:48,952 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 06:53:51,631 INFO [train.py:996] (0/4) Epoch 9, batch 8500, loss[loss=0.1925, simple_loss=0.2436, pruned_loss=0.07069, over 21255.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.289, pruned_loss=0.06695, over 4246792.36 frames. ], batch size: 548, lr: 3.33e-03, grad_scale: 16.0 2023-06-26 06:54:15,280 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1514802.0, ans=0.125 2023-06-26 06:54:49,642 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1514862.0, ans=0.2 2023-06-26 06:55:01,023 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1514922.0, ans=0.5 2023-06-26 06:55:04,702 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1514922.0, ans=0.025 2023-06-26 06:55:42,992 INFO [train.py:996] (0/4) Epoch 9, batch 8550, loss[loss=0.2403, simple_loss=0.3338, pruned_loss=0.0734, over 21830.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.2933, pruned_loss=0.0692, over 4257629.17 frames. ], batch size: 316, lr: 3.33e-03, grad_scale: 16.0 2023-06-26 06:56:28,979 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1515162.0, ans=0.125 2023-06-26 06:56:40,479 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.388e+02 5.673e+02 9.028e+02 1.285e+03 2.973e+03, threshold=1.806e+03, percent-clipped=33.0 2023-06-26 06:56:43,525 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1515162.0, ans=0.125 2023-06-26 06:57:06,188 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1515222.0, ans=0.125 2023-06-26 06:57:10,249 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1515222.0, ans=0.125 2023-06-26 06:57:12,210 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1515222.0, ans=0.035 2023-06-26 06:57:14,472 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.19 vs. limit=6.0 2023-06-26 06:57:34,133 INFO [train.py:996] (0/4) Epoch 9, batch 8600, loss[loss=0.2387, simple_loss=0.3227, pruned_loss=0.07739, over 21416.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.3008, pruned_loss=0.0717, over 4268687.31 frames. ], batch size: 131, lr: 3.33e-03, grad_scale: 16.0 2023-06-26 06:57:44,558 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.72 vs. limit=15.0 2023-06-26 06:59:25,172 INFO [train.py:996] (0/4) Epoch 9, batch 8650, loss[loss=0.1764, simple_loss=0.2596, pruned_loss=0.04657, over 21162.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.3067, pruned_loss=0.07175, over 4269635.27 frames. ], batch size: 143, lr: 3.33e-03, grad_scale: 16.0 2023-06-26 06:59:48,141 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1515702.0, ans=0.125 2023-06-26 07:00:06,832 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1515762.0, ans=0.1 2023-06-26 07:00:25,178 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.107e+02 4.849e+02 6.283e+02 8.957e+02 2.012e+03, threshold=1.257e+03, percent-clipped=3.0 2023-06-26 07:01:11,961 INFO [train.py:996] (0/4) Epoch 9, batch 8700, loss[loss=0.1757, simple_loss=0.2483, pruned_loss=0.05158, over 21311.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2985, pruned_loss=0.06814, over 4265122.52 frames. ], batch size: 160, lr: 3.33e-03, grad_scale: 16.0 2023-06-26 07:01:51,566 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.40 vs. limit=15.0 2023-06-26 07:02:19,044 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1516122.0, ans=0.125 2023-06-26 07:02:48,397 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1516182.0, ans=0.125 2023-06-26 07:02:54,667 INFO [train.py:996] (0/4) Epoch 9, batch 8750, loss[loss=0.2007, simple_loss=0.2782, pruned_loss=0.06161, over 21379.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2936, pruned_loss=0.06844, over 4272555.82 frames. ], batch size: 144, lr: 3.33e-03, grad_scale: 16.0 2023-06-26 07:03:45,906 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1516362.0, ans=0.2 2023-06-26 07:04:01,637 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1516362.0, ans=0.2 2023-06-26 07:04:02,674 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.478e+02 4.871e+02 5.858e+02 9.020e+02 2.163e+03, threshold=1.172e+03, percent-clipped=9.0 2023-06-26 07:04:51,207 INFO [train.py:996] (0/4) Epoch 9, batch 8800, loss[loss=0.2793, simple_loss=0.3565, pruned_loss=0.1011, over 21759.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.3017, pruned_loss=0.07088, over 4274522.70 frames. ], batch size: 441, lr: 3.33e-03, grad_scale: 32.0 2023-06-26 07:05:45,180 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1516662.0, ans=0.0 2023-06-26 07:06:12,127 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1516722.0, ans=0.0 2023-06-26 07:06:37,864 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1516782.0, ans=0.125 2023-06-26 07:06:46,518 INFO [train.py:996] (0/4) Epoch 9, batch 8850, loss[loss=0.2132, simple_loss=0.3046, pruned_loss=0.06089, over 21298.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3097, pruned_loss=0.07398, over 4270583.42 frames. ], batch size: 548, lr: 3.33e-03, grad_scale: 32.0 2023-06-26 07:07:18,492 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1516902.0, ans=0.125 2023-06-26 07:07:28,357 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.62 vs. limit=15.0 2023-06-26 07:07:43,093 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.458e+02 5.018e+02 7.490e+02 1.008e+03 2.036e+03, threshold=1.498e+03, percent-clipped=19.0 2023-06-26 07:07:51,142 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.04 vs. limit=15.0 2023-06-26 07:08:37,041 INFO [train.py:996] (0/4) Epoch 9, batch 8900, loss[loss=0.2439, simple_loss=0.3357, pruned_loss=0.07605, over 21433.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.305, pruned_loss=0.0723, over 4256123.56 frames. ], batch size: 471, lr: 3.33e-03, grad_scale: 16.0 2023-06-26 07:09:08,858 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1517202.0, ans=0.2 2023-06-26 07:09:21,687 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1517202.0, ans=0.0 2023-06-26 07:09:41,868 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1517322.0, ans=0.125 2023-06-26 07:10:02,479 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1517322.0, ans=0.125 2023-06-26 07:10:15,270 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1517382.0, ans=0.125 2023-06-26 07:10:34,129 INFO [train.py:996] (0/4) Epoch 9, batch 8950, loss[loss=0.1769, simple_loss=0.2437, pruned_loss=0.05504, over 21258.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.3041, pruned_loss=0.07166, over 4255031.77 frames. ], batch size: 176, lr: 3.33e-03, grad_scale: 16.0 2023-06-26 07:11:31,721 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.717e+02 6.385e+02 1.007e+03 1.831e+03 3.231e+03, threshold=2.014e+03, percent-clipped=34.0 2023-06-26 07:11:39,190 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1517622.0, ans=0.125 2023-06-26 07:12:29,531 INFO [train.py:996] (0/4) Epoch 9, batch 9000, loss[loss=0.1821, simple_loss=0.2443, pruned_loss=0.05998, over 21352.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.2973, pruned_loss=0.0706, over 4251803.98 frames. ], batch size: 211, lr: 3.33e-03, grad_scale: 16.0 2023-06-26 07:12:29,532 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-26 07:12:47,773 INFO [train.py:1028] (0/4) Epoch 9, validation: loss=0.2687, simple_loss=0.357, pruned_loss=0.09027, over 1796401.00 frames. 2023-06-26 07:12:47,774 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23714MB 2023-06-26 07:13:28,373 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.04 vs. limit=8.0 2023-06-26 07:13:32,826 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1517862.0, ans=0.05 2023-06-26 07:13:47,561 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1517862.0, ans=0.95 2023-06-26 07:14:16,350 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1517922.0, ans=0.125 2023-06-26 07:14:38,774 INFO [train.py:996] (0/4) Epoch 9, batch 9050, loss[loss=0.2672, simple_loss=0.3494, pruned_loss=0.09248, over 21830.00 frames. ], tot_loss[loss=0.215, simple_loss=0.2947, pruned_loss=0.06759, over 4252054.64 frames. ], batch size: 118, lr: 3.33e-03, grad_scale: 16.0 2023-06-26 07:14:54,353 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.10 vs. limit=22.5 2023-06-26 07:15:17,451 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1518102.0, ans=0.125 2023-06-26 07:15:38,265 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.242e+02 4.774e+02 6.783e+02 1.195e+03 2.023e+03, threshold=1.357e+03, percent-clipped=1.0 2023-06-26 07:16:03,689 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1518222.0, ans=0.1 2023-06-26 07:16:30,166 INFO [train.py:996] (0/4) Epoch 9, batch 9100, loss[loss=0.2048, simple_loss=0.2953, pruned_loss=0.05711, over 21745.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.2985, pruned_loss=0.06895, over 4256393.31 frames. ], batch size: 298, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 07:17:19,497 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1518462.0, ans=0.125 2023-06-26 07:17:31,448 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1518462.0, ans=0.125 2023-06-26 07:18:02,165 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1518582.0, ans=0.125 2023-06-26 07:18:02,234 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1518582.0, ans=0.125 2023-06-26 07:18:19,545 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1518642.0, ans=0.125 2023-06-26 07:18:20,710 INFO [train.py:996] (0/4) Epoch 9, batch 9150, loss[loss=0.2386, simple_loss=0.3277, pruned_loss=0.07476, over 21758.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.3008, pruned_loss=0.06736, over 4258282.75 frames. ], batch size: 332, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 07:18:38,322 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1518642.0, ans=0.0 2023-06-26 07:19:05,839 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1518762.0, ans=0.0 2023-06-26 07:19:10,825 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 07:19:29,344 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.947e+02 4.682e+02 7.293e+02 9.875e+02 2.025e+03, threshold=1.459e+03, percent-clipped=11.0 2023-06-26 07:20:13,556 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1518942.0, ans=0.0 2023-06-26 07:20:14,568 INFO [train.py:996] (0/4) Epoch 9, batch 9200, loss[loss=0.285, simple_loss=0.3537, pruned_loss=0.1081, over 21453.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.3009, pruned_loss=0.06614, over 4265020.78 frames. ], batch size: 471, lr: 3.32e-03, grad_scale: 32.0 2023-06-26 07:20:31,751 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1519002.0, ans=0.125 2023-06-26 07:21:12,144 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1519062.0, ans=0.0 2023-06-26 07:21:38,561 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.77 vs. limit=15.0 2023-06-26 07:21:56,825 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1519182.0, ans=0.2 2023-06-26 07:22:03,213 INFO [train.py:996] (0/4) Epoch 9, batch 9250, loss[loss=0.1846, simple_loss=0.2653, pruned_loss=0.05197, over 21770.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.3059, pruned_loss=0.0697, over 4261339.02 frames. ], batch size: 102, lr: 3.32e-03, grad_scale: 32.0 2023-06-26 07:22:51,474 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1519302.0, ans=0.2 2023-06-26 07:23:06,661 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.432e+02 5.072e+02 7.125e+02 1.070e+03 2.650e+03, threshold=1.425e+03, percent-clipped=11.0 2023-06-26 07:23:18,333 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1519422.0, ans=0.2 2023-06-26 07:23:53,073 INFO [train.py:996] (0/4) Epoch 9, batch 9300, loss[loss=0.1833, simple_loss=0.2503, pruned_loss=0.05813, over 21194.00 frames. ], tot_loss[loss=0.221, simple_loss=0.3013, pruned_loss=0.07033, over 4251918.30 frames. ], batch size: 176, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 07:24:51,883 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.79 vs. limit=15.0 2023-06-26 07:25:08,747 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1519722.0, ans=10.0 2023-06-26 07:25:39,169 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1519782.0, ans=0.125 2023-06-26 07:25:39,252 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1519782.0, ans=0.2 2023-06-26 07:25:39,825 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=1519782.0, ans=10.0 2023-06-26 07:25:43,696 INFO [train.py:996] (0/4) Epoch 9, batch 9350, loss[loss=0.2542, simple_loss=0.3354, pruned_loss=0.08655, over 21711.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.3074, pruned_loss=0.07149, over 4258282.68 frames. ], batch size: 332, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 07:26:01,822 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.71 vs. limit=15.0 2023-06-26 07:26:20,992 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1519902.0, ans=0.1 2023-06-26 07:26:31,048 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1519902.0, ans=0.125 2023-06-26 07:26:54,938 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.694e+02 5.143e+02 7.806e+02 1.433e+03 2.856e+03, threshold=1.561e+03, percent-clipped=26.0 2023-06-26 07:27:34,361 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1520082.0, ans=0.125 2023-06-26 07:27:38,718 INFO [train.py:996] (0/4) Epoch 9, batch 9400, loss[loss=0.222, simple_loss=0.2817, pruned_loss=0.08115, over 21498.00 frames. ], tot_loss[loss=0.225, simple_loss=0.3067, pruned_loss=0.07163, over 4263251.37 frames. ], batch size: 441, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 07:28:41,624 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1520262.0, ans=0.0 2023-06-26 07:28:49,858 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1520322.0, ans=0.0 2023-06-26 07:28:58,556 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1520322.0, ans=0.0 2023-06-26 07:29:31,984 INFO [train.py:996] (0/4) Epoch 9, batch 9450, loss[loss=0.255, simple_loss=0.3431, pruned_loss=0.08343, over 20723.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.2995, pruned_loss=0.07011, over 4256400.04 frames. ], batch size: 607, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 07:30:31,553 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.480e+02 5.776e+02 8.947e+02 1.514e+03 4.644e+03, threshold=1.789e+03, percent-clipped=22.0 2023-06-26 07:30:44,950 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1520622.0, ans=0.0 2023-06-26 07:30:55,465 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1520682.0, ans=0.125 2023-06-26 07:31:21,081 INFO [train.py:996] (0/4) Epoch 9, batch 9500, loss[loss=0.2131, simple_loss=0.2896, pruned_loss=0.06826, over 21376.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2922, pruned_loss=0.06801, over 4257985.64 frames. ], batch size: 194, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 07:32:23,204 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.23 vs. limit=10.0 2023-06-26 07:32:36,787 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1520922.0, ans=0.1 2023-06-26 07:32:39,181 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1520922.0, ans=0.125 2023-06-26 07:33:06,711 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1520982.0, ans=0.0 2023-06-26 07:33:08,535 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1520982.0, ans=0.125 2023-06-26 07:33:12,882 INFO [train.py:996] (0/4) Epoch 9, batch 9550, loss[loss=0.2401, simple_loss=0.3389, pruned_loss=0.07064, over 21801.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.2966, pruned_loss=0.07031, over 4264003.65 frames. ], batch size: 282, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 07:33:45,324 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.10 vs. limit=10.0 2023-06-26 07:34:06,850 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1521162.0, ans=0.0 2023-06-26 07:34:11,612 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.380e+02 4.672e+02 5.675e+02 8.285e+02 1.544e+03, threshold=1.135e+03, percent-clipped=0.0 2023-06-26 07:34:12,196 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1521162.0, ans=0.1 2023-06-26 07:34:25,153 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1521222.0, ans=0.125 2023-06-26 07:35:01,320 INFO [train.py:996] (0/4) Epoch 9, batch 9600, loss[loss=0.2073, simple_loss=0.2847, pruned_loss=0.06495, over 21915.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.2986, pruned_loss=0.07138, over 4270160.22 frames. ], batch size: 316, lr: 3.32e-03, grad_scale: 32.0 2023-06-26 07:35:46,159 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1521462.0, ans=0.0 2023-06-26 07:36:52,869 INFO [train.py:996] (0/4) Epoch 9, batch 9650, loss[loss=0.2483, simple_loss=0.3731, pruned_loss=0.06173, over 20761.00 frames. ], tot_loss[loss=0.22, simple_loss=0.2983, pruned_loss=0.07081, over 4277572.82 frames. ], batch size: 607, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 07:37:07,463 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1521642.0, ans=0.125 2023-06-26 07:37:16,997 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=18.67 vs. limit=22.5 2023-06-26 07:37:47,256 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.89 vs. limit=22.5 2023-06-26 07:37:49,610 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.333e+02 4.623e+02 6.972e+02 1.187e+03 2.800e+03, threshold=1.394e+03, percent-clipped=26.0 2023-06-26 07:38:38,186 INFO [train.py:996] (0/4) Epoch 9, batch 9700, loss[loss=0.1994, simple_loss=0.2866, pruned_loss=0.05614, over 16353.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.2992, pruned_loss=0.07086, over 4268886.49 frames. ], batch size: 60, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 07:40:27,226 INFO [train.py:996] (0/4) Epoch 9, batch 9750, loss[loss=0.2365, simple_loss=0.2782, pruned_loss=0.09738, over 21380.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2945, pruned_loss=0.07028, over 4259507.24 frames. ], batch size: 508, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 07:40:56,718 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1522302.0, ans=0.125 2023-06-26 07:41:22,045 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.554e+02 4.804e+02 6.885e+02 8.968e+02 2.424e+03, threshold=1.377e+03, percent-clipped=5.0 2023-06-26 07:41:37,207 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1522422.0, ans=0.0 2023-06-26 07:42:07,402 INFO [train.py:996] (0/4) Epoch 9, batch 9800, loss[loss=0.2203, simple_loss=0.2937, pruned_loss=0.07343, over 21906.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.2959, pruned_loss=0.07092, over 4257237.00 frames. ], batch size: 118, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 07:42:22,606 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.42 vs. limit=6.0 2023-06-26 07:42:50,402 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1522602.0, ans=0.2 2023-06-26 07:43:09,557 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1522722.0, ans=0.04949747468305833 2023-06-26 07:43:29,201 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1522722.0, ans=0.125 2023-06-26 07:43:30,822 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1522722.0, ans=0.0 2023-06-26 07:43:32,463 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1522782.0, ans=0.2 2023-06-26 07:43:57,284 INFO [train.py:996] (0/4) Epoch 9, batch 9850, loss[loss=0.2181, simple_loss=0.2789, pruned_loss=0.0787, over 20184.00 frames. ], tot_loss[loss=0.217, simple_loss=0.293, pruned_loss=0.07049, over 4253880.53 frames. ], batch size: 707, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 07:44:58,522 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.446e+02 4.827e+02 6.671e+02 1.006e+03 2.121e+03, threshold=1.334e+03, percent-clipped=9.0 2023-06-26 07:45:06,419 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1523022.0, ans=0.125 2023-06-26 07:45:30,148 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1523082.0, ans=0.2 2023-06-26 07:45:33,703 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1523082.0, ans=0.0 2023-06-26 07:45:35,051 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1523082.0, ans=0.0 2023-06-26 07:45:39,418 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.56 vs. limit=22.5 2023-06-26 07:45:52,817 INFO [train.py:996] (0/4) Epoch 9, batch 9900, loss[loss=0.1846, simple_loss=0.265, pruned_loss=0.05203, over 15365.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2913, pruned_loss=0.06943, over 4243213.57 frames. ], batch size: 60, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 07:46:44,410 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.28 vs. limit=15.0 2023-06-26 07:47:14,097 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1523322.0, ans=0.1 2023-06-26 07:47:35,626 INFO [train.py:996] (0/4) Epoch 9, batch 9950, loss[loss=0.2666, simple_loss=0.3066, pruned_loss=0.1133, over 21393.00 frames. ], tot_loss[loss=0.2181, simple_loss=0.2928, pruned_loss=0.07164, over 4237835.40 frames. ], batch size: 509, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 07:48:38,011 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.372e+02 4.966e+02 6.562e+02 9.646e+02 1.795e+03, threshold=1.312e+03, percent-clipped=7.0 2023-06-26 07:49:13,396 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.80 vs. limit=15.0 2023-06-26 07:49:31,806 INFO [train.py:996] (0/4) Epoch 9, batch 10000, loss[loss=0.1881, simple_loss=0.2619, pruned_loss=0.05711, over 21434.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2871, pruned_loss=0.07057, over 4248894.38 frames. ], batch size: 131, lr: 3.32e-03, grad_scale: 32.0 2023-06-26 07:49:39,964 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1523742.0, ans=0.125 2023-06-26 07:51:19,168 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1523982.0, ans=0.125 2023-06-26 07:51:22,422 INFO [train.py:996] (0/4) Epoch 9, batch 10050, loss[loss=0.1778, simple_loss=0.2513, pruned_loss=0.05214, over 21376.00 frames. ], tot_loss[loss=0.2156, simple_loss=0.2895, pruned_loss=0.07086, over 4252024.35 frames. ], batch size: 211, lr: 3.32e-03, grad_scale: 32.0 2023-06-26 07:51:48,040 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1524102.0, ans=0.125 2023-06-26 07:52:19,741 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1524162.0, ans=0.1 2023-06-26 07:52:31,218 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.556e+02 5.086e+02 7.732e+02 1.194e+03 2.294e+03, threshold=1.546e+03, percent-clipped=16.0 2023-06-26 07:52:58,434 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1524282.0, ans=0.125 2023-06-26 07:52:58,975 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.18 vs. limit=6.0 2023-06-26 07:53:02,243 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.28 vs. limit=15.0 2023-06-26 07:53:09,311 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.96 vs. limit=12.0 2023-06-26 07:53:13,028 INFO [train.py:996] (0/4) Epoch 9, batch 10100, loss[loss=0.2068, simple_loss=0.2942, pruned_loss=0.05976, over 21642.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2884, pruned_loss=0.06915, over 4256453.48 frames. ], batch size: 389, lr: 3.32e-03, grad_scale: 32.0 2023-06-26 07:53:31,354 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1524342.0, ans=0.0 2023-06-26 07:53:40,122 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.48 vs. limit=22.5 2023-06-26 07:54:14,275 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.62 vs. limit=10.0 2023-06-26 07:54:31,222 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1524522.0, ans=0.125 2023-06-26 07:54:45,121 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1524582.0, ans=0.125 2023-06-26 07:54:48,525 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1524582.0, ans=0.0 2023-06-26 07:55:07,058 INFO [train.py:996] (0/4) Epoch 9, batch 10150, loss[loss=0.2139, simple_loss=0.2991, pruned_loss=0.06434, over 21704.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.2934, pruned_loss=0.07147, over 4264040.58 frames. ], batch size: 247, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 07:55:18,167 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1524642.0, ans=0.0 2023-06-26 07:56:10,909 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.445e+02 5.423e+02 7.380e+02 1.011e+03 1.635e+03, threshold=1.476e+03, percent-clipped=1.0 2023-06-26 07:56:14,170 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.39 vs. limit=6.0 2023-06-26 07:56:15,366 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1524822.0, ans=0.1 2023-06-26 07:56:28,210 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.38 vs. limit=15.0 2023-06-26 07:56:32,617 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1524882.0, ans=0.125 2023-06-26 07:56:56,537 INFO [train.py:996] (0/4) Epoch 9, batch 10200, loss[loss=0.1987, simple_loss=0.2876, pruned_loss=0.05493, over 21650.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2904, pruned_loss=0.06842, over 4259285.16 frames. ], batch size: 247, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 07:57:58,927 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1525122.0, ans=0.2 2023-06-26 07:57:59,537 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.56 vs. limit=22.5 2023-06-26 07:58:35,523 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1525182.0, ans=0.1 2023-06-26 07:58:47,142 INFO [train.py:996] (0/4) Epoch 9, batch 10250, loss[loss=0.2544, simple_loss=0.3302, pruned_loss=0.08934, over 21348.00 frames. ], tot_loss[loss=0.2076, simple_loss=0.2872, pruned_loss=0.064, over 4262635.86 frames. ], batch size: 507, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 07:59:53,960 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1525362.0, ans=0.0 2023-06-26 07:59:58,345 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.778e+02 4.201e+02 6.167e+02 1.103e+03 3.116e+03, threshold=1.233e+03, percent-clipped=15.0 2023-06-26 08:00:03,024 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1525422.0, ans=0.1 2023-06-26 08:00:38,950 INFO [train.py:996] (0/4) Epoch 9, batch 10300, loss[loss=0.2411, simple_loss=0.3097, pruned_loss=0.08625, over 21303.00 frames. ], tot_loss[loss=0.2107, simple_loss=0.2906, pruned_loss=0.06534, over 4258303.77 frames. ], batch size: 176, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 08:01:31,583 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1525662.0, ans=0.125 2023-06-26 08:02:30,562 INFO [train.py:996] (0/4) Epoch 9, batch 10350, loss[loss=0.2482, simple_loss=0.3248, pruned_loss=0.08579, over 21461.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.2937, pruned_loss=0.06555, over 4269471.47 frames. ], batch size: 471, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 08:02:55,700 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=9.47 vs. limit=15.0 2023-06-26 08:03:46,309 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.376e+02 5.119e+02 7.830e+02 1.250e+03 2.539e+03, threshold=1.566e+03, percent-clipped=25.0 2023-06-26 08:03:57,014 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.20 vs. limit=15.0 2023-06-26 08:03:58,598 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.34 vs. limit=10.0 2023-06-26 08:04:30,903 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.77 vs. limit=22.5 2023-06-26 08:04:33,038 INFO [train.py:996] (0/4) Epoch 9, batch 10400, loss[loss=0.1854, simple_loss=0.2607, pruned_loss=0.05508, over 21712.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.2882, pruned_loss=0.0652, over 4262546.76 frames. ], batch size: 298, lr: 3.32e-03, grad_scale: 32.0 2023-06-26 08:04:54,191 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.66 vs. limit=10.0 2023-06-26 08:05:28,950 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1526262.0, ans=0.125 2023-06-26 08:06:24,922 INFO [train.py:996] (0/4) Epoch 9, batch 10450, loss[loss=0.3142, simple_loss=0.3875, pruned_loss=0.1205, over 21379.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.292, pruned_loss=0.06829, over 4262500.54 frames. ], batch size: 507, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 08:07:29,715 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.606e+02 5.261e+02 7.908e+02 1.020e+03 2.061e+03, threshold=1.582e+03, percent-clipped=9.0 2023-06-26 08:08:13,037 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1526742.0, ans=0.125 2023-06-26 08:08:14,055 INFO [train.py:996] (0/4) Epoch 9, batch 10500, loss[loss=0.1759, simple_loss=0.2523, pruned_loss=0.04975, over 21664.00 frames. ], tot_loss[loss=0.2125, simple_loss=0.291, pruned_loss=0.06701, over 4269137.54 frames. ], batch size: 247, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 08:08:51,154 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1526802.0, ans=0.125 2023-06-26 08:08:54,595 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1526802.0, ans=0.125 2023-06-26 08:09:10,132 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1526862.0, ans=0.0 2023-06-26 08:09:11,864 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1526862.0, ans=0.0 2023-06-26 08:09:27,969 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1526922.0, ans=0.125 2023-06-26 08:09:55,488 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.14 vs. limit=15.0 2023-06-26 08:09:59,834 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1526982.0, ans=0.125 2023-06-26 08:10:02,767 INFO [train.py:996] (0/4) Epoch 9, batch 10550, loss[loss=0.1883, simple_loss=0.2499, pruned_loss=0.06331, over 21868.00 frames. ], tot_loss[loss=0.2097, simple_loss=0.2859, pruned_loss=0.0667, over 4267345.19 frames. ], batch size: 98, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 08:10:30,921 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1527102.0, ans=0.125 2023-06-26 08:10:31,678 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.10 vs. limit=10.0 2023-06-26 08:11:02,834 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1527162.0, ans=0.09899494936611666 2023-06-26 08:11:07,151 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.351e+02 4.011e+02 5.575e+02 6.702e+02 2.123e+03, threshold=1.115e+03, percent-clipped=3.0 2023-06-26 08:11:41,767 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1527282.0, ans=0.0 2023-06-26 08:11:47,874 INFO [train.py:996] (0/4) Epoch 9, batch 10600, loss[loss=0.1871, simple_loss=0.2796, pruned_loss=0.0473, over 21716.00 frames. ], tot_loss[loss=0.2063, simple_loss=0.2819, pruned_loss=0.06529, over 4276189.45 frames. ], batch size: 298, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 08:12:26,481 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1527402.0, ans=0.125 2023-06-26 08:13:44,630 INFO [train.py:996] (0/4) Epoch 9, batch 10650, loss[loss=0.1873, simple_loss=0.2729, pruned_loss=0.05086, over 21713.00 frames. ], tot_loss[loss=0.2073, simple_loss=0.2844, pruned_loss=0.06508, over 4253504.64 frames. ], batch size: 351, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:14:17,246 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.10 vs. limit=12.0 2023-06-26 08:14:49,636 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.095e+02 4.930e+02 8.313e+02 1.262e+03 3.074e+03, threshold=1.663e+03, percent-clipped=34.0 2023-06-26 08:14:57,388 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1527822.0, ans=0.0 2023-06-26 08:15:19,624 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.34 vs. limit=15.0 2023-06-26 08:15:22,494 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1527882.0, ans=0.125 2023-06-26 08:15:34,243 INFO [train.py:996] (0/4) Epoch 9, batch 10700, loss[loss=0.2636, simple_loss=0.3373, pruned_loss=0.09495, over 21764.00 frames. ], tot_loss[loss=0.2068, simple_loss=0.2829, pruned_loss=0.06538, over 4259031.58 frames. ], batch size: 441, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:15:35,415 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=3.73 vs. limit=15.0 2023-06-26 08:15:49,320 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.64 vs. limit=15.0 2023-06-26 08:15:54,026 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1528002.0, ans=0.125 2023-06-26 08:15:57,987 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1528002.0, ans=0.125 2023-06-26 08:16:04,738 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1528002.0, ans=0.125 2023-06-26 08:16:28,378 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1528062.0, ans=0.0 2023-06-26 08:16:48,889 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1528122.0, ans=0.125 2023-06-26 08:17:13,965 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1528182.0, ans=0.125 2023-06-26 08:17:20,283 INFO [train.py:996] (0/4) Epoch 9, batch 10750, loss[loss=0.2048, simple_loss=0.289, pruned_loss=0.06032, over 21421.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2946, pruned_loss=0.06969, over 4269438.96 frames. ], batch size: 194, lr: 3.31e-03, grad_scale: 8.0 2023-06-26 08:18:33,234 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.381e+02 4.303e+02 6.075e+02 7.797e+02 1.997e+03, threshold=1.215e+03, percent-clipped=3.0 2023-06-26 08:18:56,754 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1528482.0, ans=0.0 2023-06-26 08:19:10,521 INFO [train.py:996] (0/4) Epoch 9, batch 10800, loss[loss=0.2233, simple_loss=0.3016, pruned_loss=0.07249, over 21788.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.2981, pruned_loss=0.07006, over 4266049.60 frames. ], batch size: 247, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:19:14,945 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1528542.0, ans=0.1 2023-06-26 08:19:20,223 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1528542.0, ans=0.04949747468305833 2023-06-26 08:19:52,624 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1528602.0, ans=0.0 2023-06-26 08:21:00,975 INFO [train.py:996] (0/4) Epoch 9, batch 10850, loss[loss=0.2234, simple_loss=0.2864, pruned_loss=0.08024, over 21548.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.2987, pruned_loss=0.06973, over 4265925.96 frames. ], batch size: 391, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:21:17,767 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1528842.0, ans=0.125 2023-06-26 08:22:19,324 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.316e+02 4.810e+02 7.791e+02 1.214e+03 2.371e+03, threshold=1.558e+03, percent-clipped=23.0 2023-06-26 08:22:39,377 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.67 vs. limit=12.0 2023-06-26 08:22:39,377 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.whiten.whitening_limit, batch_count=1529082.0, ans=12.0 2023-06-26 08:22:40,619 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 08:22:56,756 INFO [train.py:996] (0/4) Epoch 9, batch 10900, loss[loss=0.2007, simple_loss=0.2865, pruned_loss=0.05747, over 21808.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2933, pruned_loss=0.06774, over 4267752.64 frames. ], batch size: 371, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:23:16,596 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1529142.0, ans=0.125 2023-06-26 08:23:49,796 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1529262.0, ans=0.125 2023-06-26 08:24:23,952 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1529382.0, ans=0.2 2023-06-26 08:24:44,118 INFO [train.py:996] (0/4) Epoch 9, batch 10950, loss[loss=0.1935, simple_loss=0.2606, pruned_loss=0.06326, over 21310.00 frames. ], tot_loss[loss=0.2098, simple_loss=0.2876, pruned_loss=0.06601, over 4259862.49 frames. ], batch size: 144, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:24:50,329 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1529442.0, ans=0.125 2023-06-26 08:24:59,018 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1529442.0, ans=0.125 2023-06-26 08:25:02,929 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.66 vs. limit=10.0 2023-06-26 08:25:05,697 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1529502.0, ans=0.2 2023-06-26 08:25:34,316 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.97 vs. limit=22.5 2023-06-26 08:25:37,073 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1529562.0, ans=0.0 2023-06-26 08:25:55,405 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.406e+02 4.859e+02 7.093e+02 1.092e+03 2.550e+03, threshold=1.419e+03, percent-clipped=10.0 2023-06-26 08:26:26,619 INFO [train.py:996] (0/4) Epoch 9, batch 11000, loss[loss=0.2217, simple_loss=0.2942, pruned_loss=0.07462, over 21471.00 frames. ], tot_loss[loss=0.21, simple_loss=0.2863, pruned_loss=0.06687, over 4261571.48 frames. ], batch size: 131, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:27:48,120 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1529922.0, ans=0.0 2023-06-26 08:27:55,517 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.67 vs. limit=15.0 2023-06-26 08:28:02,621 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.64 vs. limit=22.5 2023-06-26 08:28:06,972 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1529982.0, ans=0.035 2023-06-26 08:28:20,362 INFO [train.py:996] (0/4) Epoch 9, batch 11050, loss[loss=0.1771, simple_loss=0.2437, pruned_loss=0.05521, over 21272.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.2835, pruned_loss=0.06787, over 4270588.63 frames. ], batch size: 176, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:28:22,781 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1530042.0, ans=0.2 2023-06-26 08:28:28,203 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 08:28:28,312 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1530042.0, ans=0.125 2023-06-26 08:28:44,502 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1530102.0, ans=0.0 2023-06-26 08:28:46,021 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 08:29:26,015 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1530162.0, ans=0.125 2023-06-26 08:29:32,134 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.162e+02 4.865e+02 7.286e+02 1.085e+03 1.953e+03, threshold=1.457e+03, percent-clipped=8.0 2023-06-26 08:30:03,326 INFO [train.py:996] (0/4) Epoch 9, batch 11100, loss[loss=0.2166, simple_loss=0.2925, pruned_loss=0.07038, over 21362.00 frames. ], tot_loss[loss=0.2083, simple_loss=0.2817, pruned_loss=0.06746, over 4268047.35 frames. ], batch size: 211, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:30:49,708 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1530402.0, ans=0.0 2023-06-26 08:31:00,431 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1530462.0, ans=0.0 2023-06-26 08:31:28,105 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.96 vs. limit=6.0 2023-06-26 08:31:50,859 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1530642.0, ans=0.0 2023-06-26 08:31:57,817 INFO [train.py:996] (0/4) Epoch 9, batch 11150, loss[loss=0.2137, simple_loss=0.2755, pruned_loss=0.07593, over 21531.00 frames. ], tot_loss[loss=0.2075, simple_loss=0.2802, pruned_loss=0.06744, over 4268952.81 frames. ], batch size: 441, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:33:07,532 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.41 vs. limit=22.5 2023-06-26 08:33:09,610 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.403e+02 4.594e+02 7.408e+02 1.103e+03 2.164e+03, threshold=1.482e+03, percent-clipped=12.0 2023-06-26 08:33:27,069 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1530882.0, ans=0.125 2023-06-26 08:33:40,339 INFO [train.py:996] (0/4) Epoch 9, batch 11200, loss[loss=0.1791, simple_loss=0.2482, pruned_loss=0.05502, over 21473.00 frames. ], tot_loss[loss=0.2067, simple_loss=0.2804, pruned_loss=0.06651, over 4273865.89 frames. ], batch size: 230, lr: 3.31e-03, grad_scale: 32.0 2023-06-26 08:33:51,544 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1530942.0, ans=0.125 2023-06-26 08:33:55,502 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.44 vs. limit=15.0 2023-06-26 08:34:11,013 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1531002.0, ans=0.125 2023-06-26 08:34:39,518 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1531062.0, ans=0.0 2023-06-26 08:34:57,647 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.82 vs. limit=15.0 2023-06-26 08:35:30,907 INFO [train.py:996] (0/4) Epoch 9, batch 11250, loss[loss=0.2497, simple_loss=0.305, pruned_loss=0.09713, over 21659.00 frames. ], tot_loss[loss=0.2058, simple_loss=0.2793, pruned_loss=0.06619, over 4273724.78 frames. ], batch size: 508, lr: 3.31e-03, grad_scale: 32.0 2023-06-26 08:35:44,372 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1531242.0, ans=0.0 2023-06-26 08:35:46,795 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=13.22 vs. limit=15.0 2023-06-26 08:36:05,575 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=1531302.0, ans=0.5 2023-06-26 08:36:26,509 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1531362.0, ans=0.07 2023-06-26 08:36:48,773 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.33 vs. limit=12.0 2023-06-26 08:36:50,038 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.48 vs. limit=22.5 2023-06-26 08:36:50,827 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.393e+02 4.914e+02 6.866e+02 9.264e+02 1.730e+03, threshold=1.373e+03, percent-clipped=7.0 2023-06-26 08:36:51,488 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1531422.0, ans=0.125 2023-06-26 08:37:20,667 INFO [train.py:996] (0/4) Epoch 9, batch 11300, loss[loss=0.1901, simple_loss=0.2754, pruned_loss=0.05242, over 21811.00 frames. ], tot_loss[loss=0.2069, simple_loss=0.2806, pruned_loss=0.06655, over 4284818.16 frames. ], batch size: 351, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:37:36,104 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1531542.0, ans=0.125 2023-06-26 08:37:50,095 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 08:38:24,926 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=1531662.0, ans=0.05 2023-06-26 08:38:42,915 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1531722.0, ans=0.125 2023-06-26 08:39:16,335 INFO [train.py:996] (0/4) Epoch 9, batch 11350, loss[loss=0.1998, simple_loss=0.2813, pruned_loss=0.05912, over 21172.00 frames. ], tot_loss[loss=0.2082, simple_loss=0.2833, pruned_loss=0.06654, over 4280977.18 frames. ], batch size: 143, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:40:31,491 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.547e+02 4.947e+02 6.813e+02 1.038e+03 3.040e+03, threshold=1.363e+03, percent-clipped=13.0 2023-06-26 08:41:05,591 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1532082.0, ans=0.125 2023-06-26 08:41:08,364 INFO [train.py:996] (0/4) Epoch 9, batch 11400, loss[loss=0.2221, simple_loss=0.3034, pruned_loss=0.0704, over 21448.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2886, pruned_loss=0.06837, over 4279112.52 frames. ], batch size: 211, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:41:49,092 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.61 vs. limit=10.0 2023-06-26 08:42:04,285 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1532262.0, ans=0.2 2023-06-26 08:42:40,009 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1532382.0, ans=0.125 2023-06-26 08:42:43,971 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1532382.0, ans=0.0 2023-06-26 08:42:50,772 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1532382.0, ans=0.1 2023-06-26 08:43:04,954 INFO [train.py:996] (0/4) Epoch 9, batch 11450, loss[loss=0.2341, simple_loss=0.3115, pruned_loss=0.07837, over 21705.00 frames. ], tot_loss[loss=0.2125, simple_loss=0.2899, pruned_loss=0.06749, over 4281499.90 frames. ], batch size: 351, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:43:23,873 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1532442.0, ans=0.0 2023-06-26 08:43:30,764 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=1532502.0, ans=6.0 2023-06-26 08:43:57,845 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1532562.0, ans=0.125 2023-06-26 08:44:04,135 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1532562.0, ans=0.125 2023-06-26 08:44:14,734 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.477e+02 5.112e+02 7.054e+02 1.112e+03 2.275e+03, threshold=1.411e+03, percent-clipped=15.0 2023-06-26 08:45:01,354 INFO [train.py:996] (0/4) Epoch 9, batch 11500, loss[loss=0.232, simple_loss=0.3232, pruned_loss=0.07035, over 21630.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2937, pruned_loss=0.06854, over 4285761.85 frames. ], batch size: 414, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:45:07,351 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1532742.0, ans=0.0 2023-06-26 08:45:40,469 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1532862.0, ans=0.2 2023-06-26 08:45:42,351 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1532862.0, ans=0.0 2023-06-26 08:45:46,268 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1532862.0, ans=0.125 2023-06-26 08:46:04,526 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1532922.0, ans=0.125 2023-06-26 08:46:30,109 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1532982.0, ans=0.125 2023-06-26 08:46:53,181 INFO [train.py:996] (0/4) Epoch 9, batch 11550, loss[loss=0.2528, simple_loss=0.3448, pruned_loss=0.08037, over 21628.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.2999, pruned_loss=0.06878, over 4286356.25 frames. ], batch size: 263, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:47:29,039 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.67 vs. limit=15.0 2023-06-26 08:48:07,906 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.05 vs. limit=6.0 2023-06-26 08:48:08,428 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.509e+02 5.865e+02 8.299e+02 1.163e+03 3.420e+03, threshold=1.660e+03, percent-clipped=18.0 2023-06-26 08:48:23,078 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1533222.0, ans=0.025 2023-06-26 08:48:41,235 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1533282.0, ans=0.2 2023-06-26 08:48:48,932 INFO [train.py:996] (0/4) Epoch 9, batch 11600, loss[loss=0.2323, simple_loss=0.3338, pruned_loss=0.06533, over 21693.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.3114, pruned_loss=0.07034, over 4281701.65 frames. ], batch size: 263, lr: 3.31e-03, grad_scale: 32.0 2023-06-26 08:48:54,221 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.16 vs. limit=12.0 2023-06-26 08:49:27,586 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1533402.0, ans=10.0 2023-06-26 08:49:32,625 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1533462.0, ans=0.125 2023-06-26 08:49:34,400 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1533462.0, ans=0.025 2023-06-26 08:50:15,851 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 08:50:15,948 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1533582.0, ans=0.125 2023-06-26 08:50:38,005 INFO [train.py:996] (0/4) Epoch 9, batch 11650, loss[loss=0.2187, simple_loss=0.3114, pruned_loss=0.06299, over 21426.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.319, pruned_loss=0.07178, over 4275968.78 frames. ], batch size: 211, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:51:19,742 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 08:51:52,917 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.214e+02 7.495e+02 1.149e+03 1.864e+03 4.386e+03, threshold=2.298e+03, percent-clipped=28.0 2023-06-26 08:52:13,099 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.83 vs. limit=15.0 2023-06-26 08:52:26,006 INFO [train.py:996] (0/4) Epoch 9, batch 11700, loss[loss=0.1892, simple_loss=0.2606, pruned_loss=0.05885, over 21581.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.3093, pruned_loss=0.07048, over 4277205.95 frames. ], batch size: 263, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:52:26,582 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1533942.0, ans=0.125 2023-06-26 08:54:02,304 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1534182.0, ans=0.2 2023-06-26 08:54:13,619 INFO [train.py:996] (0/4) Epoch 9, batch 11750, loss[loss=0.2307, simple_loss=0.3031, pruned_loss=0.07917, over 21269.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.3004, pruned_loss=0.07025, over 4272862.86 frames. ], batch size: 176, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:54:45,522 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.59 vs. limit=12.0 2023-06-26 08:54:49,174 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.78 vs. limit=15.0 2023-06-26 08:55:31,060 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.373e+02 4.368e+02 6.221e+02 1.023e+03 2.709e+03, threshold=1.244e+03, percent-clipped=2.0 2023-06-26 08:55:34,610 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.12 vs. limit=15.0 2023-06-26 08:55:49,450 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1534482.0, ans=0.04949747468305833 2023-06-26 08:56:03,967 INFO [train.py:996] (0/4) Epoch 9, batch 11800, loss[loss=0.2159, simple_loss=0.3132, pruned_loss=0.05925, over 20736.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.3014, pruned_loss=0.07188, over 4271858.80 frames. ], batch size: 607, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:56:08,504 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1534542.0, ans=0.05 2023-06-26 08:56:18,823 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1534542.0, ans=0.125 2023-06-26 08:57:19,246 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1534722.0, ans=0.125 2023-06-26 08:57:52,746 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1534842.0, ans=0.0 2023-06-26 08:57:53,769 INFO [train.py:996] (0/4) Epoch 9, batch 11850, loss[loss=0.2235, simple_loss=0.3148, pruned_loss=0.06613, over 21437.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.3029, pruned_loss=0.07181, over 4276078.29 frames. ], batch size: 548, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:58:11,279 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1534842.0, ans=0.0 2023-06-26 08:58:52,718 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1534962.0, ans=0.0 2023-06-26 08:58:57,773 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 08:59:16,062 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.272e+02 4.313e+02 5.764e+02 8.343e+02 1.784e+03, threshold=1.153e+03, percent-clipped=5.0 2023-06-26 08:59:50,230 INFO [train.py:996] (0/4) Epoch 9, batch 11900, loss[loss=0.2532, simple_loss=0.3282, pruned_loss=0.08913, over 21378.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.3026, pruned_loss=0.06908, over 4276211.52 frames. ], batch size: 471, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 09:00:25,559 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1535202.0, ans=0.0 2023-06-26 09:00:27,266 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1535202.0, ans=0.2 2023-06-26 09:00:45,065 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1535262.0, ans=0.0 2023-06-26 09:00:48,902 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 09:01:13,104 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.24 vs. limit=15.0 2023-06-26 09:01:15,559 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1535382.0, ans=0.125 2023-06-26 09:01:36,242 INFO [train.py:996] (0/4) Epoch 9, batch 11950, loss[loss=0.1708, simple_loss=0.2691, pruned_loss=0.0363, over 21722.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.3037, pruned_loss=0.06695, over 4272078.78 frames. ], batch size: 351, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 09:01:41,057 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.12 vs. limit=6.0 2023-06-26 09:02:35,598 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=2.98 vs. limit=12.0 2023-06-26 09:02:50,741 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.248e+02 4.636e+02 6.640e+02 1.069e+03 2.597e+03, threshold=1.328e+03, percent-clipped=19.0 2023-06-26 09:03:23,562 INFO [train.py:996] (0/4) Epoch 9, batch 12000, loss[loss=0.1831, simple_loss=0.25, pruned_loss=0.05816, over 21198.00 frames. ], tot_loss[loss=0.2142, simple_loss=0.2981, pruned_loss=0.06516, over 4275856.75 frames. ], batch size: 548, lr: 3.31e-03, grad_scale: 32.0 2023-06-26 09:03:23,563 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-26 09:03:41,736 INFO [train.py:1028] (0/4) Epoch 9, validation: loss=0.2638, simple_loss=0.3517, pruned_loss=0.08798, over 1796401.00 frames. 2023-06-26 09:03:41,738 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23714MB 2023-06-26 09:04:00,833 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1535742.0, ans=0.2 2023-06-26 09:04:10,101 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1535802.0, ans=0.125 2023-06-26 09:04:11,615 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1535802.0, ans=0.05 2023-06-26 09:04:30,535 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1535802.0, ans=0.1 2023-06-26 09:04:34,334 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.21 vs. limit=15.0 2023-06-26 09:04:49,618 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.42 vs. limit=22.5 2023-06-26 09:05:16,670 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-256000.pt 2023-06-26 09:05:25,918 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.21 vs. limit=15.0 2023-06-26 09:05:31,717 INFO [train.py:996] (0/4) Epoch 9, batch 12050, loss[loss=0.2135, simple_loss=0.2849, pruned_loss=0.07106, over 21290.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2938, pruned_loss=0.06688, over 4283203.75 frames. ], batch size: 159, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 09:06:27,050 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.94 vs. limit=22.5 2023-06-26 09:06:54,783 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.278e+02 4.979e+02 7.743e+02 1.300e+03 2.733e+03, threshold=1.549e+03, percent-clipped=23.0 2023-06-26 09:07:34,240 INFO [train.py:996] (0/4) Epoch 9, batch 12100, loss[loss=0.2211, simple_loss=0.3031, pruned_loss=0.06951, over 21707.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.299, pruned_loss=0.07077, over 4283498.05 frames. ], batch size: 298, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 09:07:56,178 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.16 vs. limit=6.0 2023-06-26 09:09:27,810 INFO [train.py:996] (0/4) Epoch 9, batch 12150, loss[loss=0.1867, simple_loss=0.2371, pruned_loss=0.06812, over 20855.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.3036, pruned_loss=0.07051, over 4268461.28 frames. ], batch size: 611, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 09:09:48,973 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1536702.0, ans=0.125 2023-06-26 09:10:12,155 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1536762.0, ans=0.2 2023-06-26 09:10:43,936 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.377e+02 5.272e+02 8.352e+02 1.536e+03 2.585e+03, threshold=1.670e+03, percent-clipped=24.0 2023-06-26 09:11:06,119 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1536882.0, ans=0.07 2023-06-26 09:11:19,131 INFO [train.py:996] (0/4) Epoch 9, batch 12200, loss[loss=0.2153, simple_loss=0.2741, pruned_loss=0.07824, over 21177.00 frames. ], tot_loss[loss=0.219, simple_loss=0.2984, pruned_loss=0.06975, over 4271893.03 frames. ], batch size: 160, lr: 3.30e-03, grad_scale: 16.0 2023-06-26 09:12:03,424 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1537062.0, ans=0.1 2023-06-26 09:12:17,313 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1537122.0, ans=0.05 2023-06-26 09:12:28,182 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1537122.0, ans=0.09899494936611666 2023-06-26 09:12:57,373 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1537182.0, ans=0.125 2023-06-26 09:13:06,915 INFO [train.py:996] (0/4) Epoch 9, batch 12250, loss[loss=0.1757, simple_loss=0.251, pruned_loss=0.05019, over 21275.00 frames. ], tot_loss[loss=0.2118, simple_loss=0.2901, pruned_loss=0.06675, over 4275168.91 frames. ], batch size: 176, lr: 3.30e-03, grad_scale: 16.0 2023-06-26 09:13:09,210 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1537242.0, ans=0.125 2023-06-26 09:13:17,765 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1537242.0, ans=0.1 2023-06-26 09:13:17,823 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1537242.0, ans=0.125 2023-06-26 09:13:32,461 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=6.98 vs. limit=22.5 2023-06-26 09:14:12,895 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.985e+02 4.210e+02 5.762e+02 8.754e+02 2.023e+03, threshold=1.152e+03, percent-clipped=2.0 2023-06-26 09:14:36,306 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 09:14:55,454 INFO [train.py:996] (0/4) Epoch 9, batch 12300, loss[loss=0.1551, simple_loss=0.2369, pruned_loss=0.03663, over 21288.00 frames. ], tot_loss[loss=0.2031, simple_loss=0.2832, pruned_loss=0.0615, over 4281856.41 frames. ], batch size: 176, lr: 3.30e-03, grad_scale: 16.0 2023-06-26 09:15:03,221 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1537542.0, ans=0.125 2023-06-26 09:15:08,644 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1537542.0, ans=0.125 2023-06-26 09:15:41,635 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1537662.0, ans=0.0 2023-06-26 09:16:42,689 INFO [train.py:996] (0/4) Epoch 9, batch 12350, loss[loss=0.2183, simple_loss=0.3026, pruned_loss=0.06696, over 21901.00 frames. ], tot_loss[loss=0.206, simple_loss=0.2883, pruned_loss=0.06189, over 4278414.22 frames. ], batch size: 316, lr: 3.30e-03, grad_scale: 16.0 2023-06-26 09:17:09,502 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.28 vs. limit=15.0 2023-06-26 09:17:47,949 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.420e+02 5.624e+02 9.354e+02 1.463e+03 3.322e+03, threshold=1.871e+03, percent-clipped=32.0 2023-06-26 09:18:13,474 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1538082.0, ans=0.0 2023-06-26 09:18:28,172 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1538142.0, ans=0.125 2023-06-26 09:18:29,217 INFO [train.py:996] (0/4) Epoch 9, batch 12400, loss[loss=0.207, simple_loss=0.2732, pruned_loss=0.07036, over 21667.00 frames. ], tot_loss[loss=0.2094, simple_loss=0.2895, pruned_loss=0.06465, over 4282214.82 frames. ], batch size: 230, lr: 3.30e-03, grad_scale: 32.0 2023-06-26 09:20:13,928 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1538382.0, ans=0.125 2023-06-26 09:20:18,879 INFO [train.py:996] (0/4) Epoch 9, batch 12450, loss[loss=0.2278, simple_loss=0.3038, pruned_loss=0.07587, over 21815.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2935, pruned_loss=0.06778, over 4287450.93 frames. ], batch size: 282, lr: 3.30e-03, grad_scale: 32.0 2023-06-26 09:21:43,232 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.605e+02 6.014e+02 7.920e+02 1.251e+03 2.737e+03, threshold=1.584e+03, percent-clipped=3.0 2023-06-26 09:22:16,000 INFO [train.py:996] (0/4) Epoch 9, batch 12500, loss[loss=0.2554, simple_loss=0.3475, pruned_loss=0.08165, over 21775.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.3033, pruned_loss=0.07102, over 4282634.27 frames. ], batch size: 124, lr: 3.30e-03, grad_scale: 32.0 2023-06-26 09:22:23,420 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1538742.0, ans=0.125 2023-06-26 09:23:39,545 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1538922.0, ans=0.1 2023-06-26 09:24:07,307 INFO [train.py:996] (0/4) Epoch 9, batch 12550, loss[loss=0.231, simple_loss=0.3139, pruned_loss=0.07402, over 21664.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.3077, pruned_loss=0.07303, over 4279337.57 frames. ], batch size: 298, lr: 3.30e-03, grad_scale: 16.0 2023-06-26 09:24:08,062 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1539042.0, ans=0.125 2023-06-26 09:24:33,189 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.84 vs. limit=15.0 2023-06-26 09:25:03,078 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1539162.0, ans=0.125 2023-06-26 09:25:32,849 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.614e+02 5.506e+02 7.478e+02 1.164e+03 2.448e+03, threshold=1.496e+03, percent-clipped=9.0 2023-06-26 09:25:36,896 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1539222.0, ans=0.0 2023-06-26 09:26:02,766 INFO [train.py:996] (0/4) Epoch 9, batch 12600, loss[loss=0.2593, simple_loss=0.3412, pruned_loss=0.08874, over 21419.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.3073, pruned_loss=0.0713, over 4282761.63 frames. ], batch size: 507, lr: 3.30e-03, grad_scale: 16.0 2023-06-26 09:26:11,904 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.27 vs. limit=6.0 2023-06-26 09:27:01,500 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1539462.0, ans=0.0 2023-06-26 09:27:33,249 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1539582.0, ans=0.0 2023-06-26 09:27:50,964 INFO [train.py:996] (0/4) Epoch 9, batch 12650, loss[loss=0.2161, simple_loss=0.2906, pruned_loss=0.07079, over 21874.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.3006, pruned_loss=0.06796, over 4279038.34 frames. ], batch size: 124, lr: 3.30e-03, grad_scale: 16.0 2023-06-26 09:29:03,550 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.98 vs. limit=22.5 2023-06-26 09:29:09,317 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.312e+02 4.812e+02 9.064e+02 1.405e+03 2.946e+03, threshold=1.813e+03, percent-clipped=21.0 2023-06-26 09:29:10,018 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1539822.0, ans=0.0 2023-06-26 09:29:41,858 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1539882.0, ans=0.2 2023-06-26 09:29:43,820 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1539942.0, ans=0.125 2023-06-26 09:29:44,748 INFO [train.py:996] (0/4) Epoch 9, batch 12700, loss[loss=0.2534, simple_loss=0.3236, pruned_loss=0.09162, over 21807.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.2993, pruned_loss=0.06983, over 4285312.05 frames. ], batch size: 441, lr: 3.30e-03, grad_scale: 16.0 2023-06-26 09:29:52,425 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1539942.0, ans=0.1 2023-06-26 09:30:04,265 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1539942.0, ans=0.125 2023-06-26 09:30:13,408 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1540002.0, ans=0.125 2023-06-26 09:30:18,675 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1540002.0, ans=0.125 2023-06-26 09:30:50,149 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1540122.0, ans=0.2 2023-06-26 09:31:00,411 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1540122.0, ans=0.125 2023-06-26 09:31:20,921 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1540182.0, ans=0.0 2023-06-26 09:31:32,355 INFO [train.py:996] (0/4) Epoch 9, batch 12750, loss[loss=0.2264, simple_loss=0.2953, pruned_loss=0.07874, over 19936.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.3, pruned_loss=0.06992, over 4284048.46 frames. ], batch size: 702, lr: 3.30e-03, grad_scale: 16.0 2023-06-26 09:31:36,584 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1540242.0, ans=0.0 2023-06-26 09:32:08,255 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.55 vs. limit=12.0 2023-06-26 09:32:45,550 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.578e+02 5.161e+02 7.205e+02 9.772e+02 1.736e+03, threshold=1.441e+03, percent-clipped=0.0 2023-06-26 09:33:00,898 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1540482.0, ans=0.0 2023-06-26 09:33:16,982 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1540482.0, ans=0.0 2023-06-26 09:33:19,535 INFO [train.py:996] (0/4) Epoch 9, batch 12800, loss[loss=0.2115, simple_loss=0.2847, pruned_loss=0.06913, over 21857.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.2988, pruned_loss=0.07001, over 4283154.24 frames. ], batch size: 247, lr: 3.30e-03, grad_scale: 32.0 2023-06-26 09:33:38,872 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1540542.0, ans=0.125 2023-06-26 09:35:07,475 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1540782.0, ans=0.125 2023-06-26 09:35:10,706 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1540782.0, ans=0.0 2023-06-26 09:35:13,893 INFO [train.py:996] (0/4) Epoch 9, batch 12850, loss[loss=0.1867, simple_loss=0.2885, pruned_loss=0.04238, over 21831.00 frames. ], tot_loss[loss=0.222, simple_loss=0.3011, pruned_loss=0.0714, over 4286689.42 frames. ], batch size: 371, lr: 3.30e-03, grad_scale: 32.0 2023-06-26 09:35:30,992 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1540902.0, ans=0.125 2023-06-26 09:35:54,393 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1540962.0, ans=0.0 2023-06-26 09:36:05,606 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1540962.0, ans=0.0 2023-06-26 09:36:12,468 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1541022.0, ans=0.1 2023-06-26 09:36:26,757 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1541022.0, ans=0.125 2023-06-26 09:36:33,396 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1541022.0, ans=0.0 2023-06-26 09:36:36,355 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.020e+02 4.565e+02 5.945e+02 7.206e+02 1.665e+03, threshold=1.189e+03, percent-clipped=1.0 2023-06-26 09:36:40,997 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1541082.0, ans=0.125 2023-06-26 09:37:04,599 INFO [train.py:996] (0/4) Epoch 9, batch 12900, loss[loss=0.1743, simple_loss=0.2523, pruned_loss=0.04818, over 21787.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2986, pruned_loss=0.06798, over 4287544.12 frames. ], batch size: 118, lr: 3.30e-03, grad_scale: 16.0 2023-06-26 09:37:05,458 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1541142.0, ans=0.125 2023-06-26 09:38:16,948 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1541322.0, ans=0.125 2023-06-26 09:38:44,739 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1541382.0, ans=10.0 2023-06-26 09:38:55,134 INFO [train.py:996] (0/4) Epoch 9, batch 12950, loss[loss=0.1549, simple_loss=0.224, pruned_loss=0.04288, over 17043.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2976, pruned_loss=0.06665, over 4283103.07 frames. ], batch size: 63, lr: 3.30e-03, grad_scale: 16.0 2023-06-26 09:39:06,460 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1541442.0, ans=0.125 2023-06-26 09:40:20,607 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.46 vs. limit=15.0 2023-06-26 09:40:21,282 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.550e+02 5.457e+02 7.611e+02 1.240e+03 2.264e+03, threshold=1.522e+03, percent-clipped=25.0 2023-06-26 09:40:26,662 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1541682.0, ans=0.2 2023-06-26 09:40:35,651 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1541682.0, ans=0.125 2023-06-26 09:40:43,304 INFO [train.py:996] (0/4) Epoch 9, batch 13000, loss[loss=0.1742, simple_loss=0.2585, pruned_loss=0.04499, over 21617.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.2974, pruned_loss=0.06604, over 4282664.46 frames. ], batch size: 263, lr: 3.30e-03, grad_scale: 16.0 2023-06-26 09:41:45,533 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1541862.0, ans=0.2 2023-06-26 09:42:19,217 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1541982.0, ans=0.1 2023-06-26 09:42:21,151 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.25 vs. limit=15.0 2023-06-26 09:42:31,836 INFO [train.py:996] (0/4) Epoch 9, batch 13050, loss[loss=0.192, simple_loss=0.265, pruned_loss=0.05952, over 21705.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.2955, pruned_loss=0.06467, over 4269927.22 frames. ], batch size: 230, lr: 3.30e-03, grad_scale: 16.0 2023-06-26 09:43:49,541 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.39 vs. limit=6.0 2023-06-26 09:43:58,312 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.428e+02 4.464e+02 7.205e+02 1.000e+03 2.248e+03, threshold=1.441e+03, percent-clipped=5.0 2023-06-26 09:44:15,031 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1542282.0, ans=0.125 2023-06-26 09:44:21,915 INFO [train.py:996] (0/4) Epoch 9, batch 13100, loss[loss=0.1995, simple_loss=0.2942, pruned_loss=0.05239, over 21783.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.2966, pruned_loss=0.06484, over 4268652.12 frames. ], batch size: 332, lr: 3.30e-03, grad_scale: 16.0 2023-06-26 09:44:24,410 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1542342.0, ans=0.125 2023-06-26 09:45:29,741 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1542462.0, ans=0.0 2023-06-26 09:45:53,630 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1542522.0, ans=0.125 2023-06-26 09:46:19,137 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1542642.0, ans=0.125 2023-06-26 09:46:20,422 INFO [train.py:996] (0/4) Epoch 9, batch 13150, loss[loss=0.2022, simple_loss=0.2829, pruned_loss=0.06078, over 21718.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2996, pruned_loss=0.0682, over 4273523.29 frames. ], batch size: 332, lr: 3.30e-03, grad_scale: 16.0 2023-06-26 09:47:36,143 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.33 vs. limit=15.0 2023-06-26 09:47:41,097 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1542822.0, ans=0.0 2023-06-26 09:47:43,985 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.265e+02 6.125e+02 9.524e+02 1.520e+03 3.301e+03, threshold=1.905e+03, percent-clipped=27.0 2023-06-26 09:47:44,483 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1542822.0, ans=0.0 2023-06-26 09:47:46,081 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1542822.0, ans=0.125 2023-06-26 09:48:03,696 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1542882.0, ans=0.125 2023-06-26 09:48:23,365 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1542942.0, ans=0.0 2023-06-26 09:48:24,354 INFO [train.py:996] (0/4) Epoch 9, batch 13200, loss[loss=0.2449, simple_loss=0.3209, pruned_loss=0.08441, over 21582.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2967, pruned_loss=0.06866, over 4273766.12 frames. ], batch size: 415, lr: 3.30e-03, grad_scale: 32.0 2023-06-26 09:48:28,871 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1542942.0, ans=0.125 2023-06-26 09:48:35,772 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1542942.0, ans=0.0 2023-06-26 09:49:22,132 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1543122.0, ans=0.125 2023-06-26 09:50:16,142 INFO [train.py:996] (0/4) Epoch 9, batch 13250, loss[loss=0.2097, simple_loss=0.2979, pruned_loss=0.06075, over 20662.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.2973, pruned_loss=0.06963, over 4273395.28 frames. ], batch size: 607, lr: 3.30e-03, grad_scale: 16.0 2023-06-26 09:50:16,810 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1543242.0, ans=0.2 2023-06-26 09:51:13,695 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1543422.0, ans=0.0 2023-06-26 09:51:48,022 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.553e+02 4.712e+02 6.598e+02 9.234e+02 1.581e+03, threshold=1.320e+03, percent-clipped=0.0 2023-06-26 09:51:57,689 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1543482.0, ans=0.04949747468305833 2023-06-26 09:52:13,053 INFO [train.py:996] (0/4) Epoch 9, batch 13300, loss[loss=0.2497, simple_loss=0.3407, pruned_loss=0.07936, over 21621.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.3009, pruned_loss=0.06949, over 4274193.64 frames. ], batch size: 414, lr: 3.30e-03, grad_scale: 8.0 2023-06-26 09:52:28,444 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1543542.0, ans=0.125 2023-06-26 09:53:07,052 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1543662.0, ans=0.0 2023-06-26 09:53:26,838 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1543722.0, ans=0.0 2023-06-26 09:54:02,427 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn2.whiten.whitening_limit, batch_count=1543842.0, ans=22.5 2023-06-26 09:54:02,890 INFO [train.py:996] (0/4) Epoch 9, batch 13350, loss[loss=0.2497, simple_loss=0.3273, pruned_loss=0.086, over 21718.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.3049, pruned_loss=0.07166, over 4274662.30 frames. ], batch size: 247, lr: 3.30e-03, grad_scale: 8.0 2023-06-26 09:54:33,062 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1543902.0, ans=0.2 2023-06-26 09:54:40,139 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1543962.0, ans=0.0 2023-06-26 09:55:17,366 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1544022.0, ans=0.1 2023-06-26 09:55:27,615 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.028e+02 5.402e+02 7.933e+02 1.042e+03 2.169e+03, threshold=1.587e+03, percent-clipped=13.0 2023-06-26 09:55:31,735 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1544082.0, ans=0.2 2023-06-26 09:55:40,524 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1544082.0, ans=0.2 2023-06-26 09:55:49,717 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.80 vs. limit=12.0 2023-06-26 09:55:51,788 INFO [train.py:996] (0/4) Epoch 9, batch 13400, loss[loss=0.2481, simple_loss=0.3191, pruned_loss=0.08857, over 21712.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.3055, pruned_loss=0.07195, over 4277171.03 frames. ], batch size: 414, lr: 3.30e-03, grad_scale: 8.0 2023-06-26 09:56:05,523 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1544142.0, ans=0.125 2023-06-26 09:56:16,788 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.57 vs. limit=15.0 2023-06-26 09:56:31,226 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.37 vs. limit=22.5 2023-06-26 09:56:56,570 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1544322.0, ans=0.0 2023-06-26 09:56:58,492 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1544322.0, ans=0.0 2023-06-26 09:56:59,172 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.65 vs. limit=15.0 2023-06-26 09:57:00,203 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1544322.0, ans=0.125 2023-06-26 09:57:28,954 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1544382.0, ans=0.125 2023-06-26 09:57:39,192 INFO [train.py:996] (0/4) Epoch 9, batch 13450, loss[loss=0.2139, simple_loss=0.2889, pruned_loss=0.06947, over 21645.00 frames. ], tot_loss[loss=0.228, simple_loss=0.3068, pruned_loss=0.07465, over 4270070.34 frames. ], batch size: 391, lr: 3.30e-03, grad_scale: 8.0 2023-06-26 09:57:45,061 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1544442.0, ans=0.125 2023-06-26 09:58:44,502 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1544562.0, ans=0.125 2023-06-26 09:59:09,293 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1544622.0, ans=0.125 2023-06-26 09:59:10,548 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.398e+02 5.052e+02 6.156e+02 8.765e+02 1.835e+03, threshold=1.231e+03, percent-clipped=4.0 2023-06-26 09:59:30,352 INFO [train.py:996] (0/4) Epoch 9, batch 13500, loss[loss=0.2171, simple_loss=0.2933, pruned_loss=0.07044, over 21872.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.2986, pruned_loss=0.07252, over 4273402.10 frames. ], batch size: 317, lr: 3.30e-03, grad_scale: 8.0 2023-06-26 10:00:12,195 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1544802.0, ans=0.125 2023-06-26 10:00:34,939 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.60 vs. limit=22.5 2023-06-26 10:00:42,867 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1544862.0, ans=0.125 2023-06-26 10:00:46,803 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1544922.0, ans=0.0 2023-06-26 10:00:52,148 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1544922.0, ans=0.125 2023-06-26 10:00:52,813 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.50 vs. limit=15.0 2023-06-26 10:01:03,070 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1544982.0, ans=0.0 2023-06-26 10:01:27,166 INFO [train.py:996] (0/4) Epoch 9, batch 13550, loss[loss=0.2269, simple_loss=0.3352, pruned_loss=0.05931, over 21708.00 frames. ], tot_loss[loss=0.223, simple_loss=0.302, pruned_loss=0.072, over 4274042.57 frames. ], batch size: 298, lr: 3.30e-03, grad_scale: 8.0 2023-06-26 10:02:03,424 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1545102.0, ans=0.125 2023-06-26 10:02:07,043 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 10:02:07,713 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.98 vs. limit=22.5 2023-06-26 10:02:51,994 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.799e+02 5.933e+02 9.358e+02 1.476e+03 2.986e+03, threshold=1.872e+03, percent-clipped=34.0 2023-06-26 10:03:16,842 INFO [train.py:996] (0/4) Epoch 9, batch 13600, loss[loss=0.2003, simple_loss=0.2829, pruned_loss=0.05887, over 21414.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.3036, pruned_loss=0.07248, over 4274471.64 frames. ], batch size: 131, lr: 3.30e-03, grad_scale: 16.0 2023-06-26 10:03:46,100 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1545402.0, ans=0.2 2023-06-26 10:03:49,543 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1545402.0, ans=0.125 2023-06-26 10:04:31,961 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1545522.0, ans=0.125 2023-06-26 10:04:59,329 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1545582.0, ans=0.0 2023-06-26 10:05:04,171 INFO [train.py:996] (0/4) Epoch 9, batch 13650, loss[loss=0.1772, simple_loss=0.2535, pruned_loss=0.05046, over 21632.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.2977, pruned_loss=0.0697, over 4270449.37 frames. ], batch size: 332, lr: 3.30e-03, grad_scale: 16.0 2023-06-26 10:05:19,507 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.85 vs. limit=15.0 2023-06-26 10:05:41,199 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.13 vs. limit=22.5 2023-06-26 10:05:47,690 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1545762.0, ans=0.1 2023-06-26 10:06:05,266 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1545762.0, ans=0.125 2023-06-26 10:06:23,708 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.401e+02 4.955e+02 6.723e+02 8.963e+02 2.035e+03, threshold=1.345e+03, percent-clipped=1.0 2023-06-26 10:06:48,903 INFO [train.py:996] (0/4) Epoch 9, batch 13700, loss[loss=0.1927, simple_loss=0.262, pruned_loss=0.06169, over 21604.00 frames. ], tot_loss[loss=0.216, simple_loss=0.2924, pruned_loss=0.06982, over 4270849.62 frames. ], batch size: 263, lr: 3.30e-03, grad_scale: 16.0 2023-06-26 10:06:49,538 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1545942.0, ans=0.125 2023-06-26 10:07:11,791 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.20 vs. limit=10.0 2023-06-26 10:07:18,936 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1546002.0, ans=0.125 2023-06-26 10:07:19,024 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1546002.0, ans=0.0 2023-06-26 10:07:53,879 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1546122.0, ans=0.1 2023-06-26 10:08:18,517 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1546182.0, ans=0.125 2023-06-26 10:08:45,473 INFO [train.py:996] (0/4) Epoch 9, batch 13750, loss[loss=0.2699, simple_loss=0.3438, pruned_loss=0.09805, over 21513.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2905, pruned_loss=0.06926, over 4271458.50 frames. ], batch size: 508, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:09:07,726 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1546302.0, ans=0.0 2023-06-26 10:09:20,456 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 10:09:26,247 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1546362.0, ans=0.0 2023-06-26 10:09:40,761 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1546362.0, ans=0.2 2023-06-26 10:10:09,373 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.59 vs. limit=22.5 2023-06-26 10:10:16,979 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.609e+02 6.154e+02 1.114e+03 1.508e+03 3.073e+03, threshold=2.228e+03, percent-clipped=34.0 2023-06-26 10:10:41,606 INFO [train.py:996] (0/4) Epoch 9, batch 13800, loss[loss=0.327, simple_loss=0.4188, pruned_loss=0.1176, over 21463.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2966, pruned_loss=0.06894, over 4273795.24 frames. ], batch size: 507, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:12:04,412 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.16 vs. limit=15.0 2023-06-26 10:12:32,876 INFO [train.py:996] (0/4) Epoch 9, batch 13850, loss[loss=0.1956, simple_loss=0.2732, pruned_loss=0.05894, over 21861.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.3032, pruned_loss=0.07012, over 4272295.98 frames. ], batch size: 107, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:12:36,996 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1546842.0, ans=0.0 2023-06-26 10:12:43,981 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1546842.0, ans=0.0 2023-06-26 10:13:19,845 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=1546962.0, ans=0.05 2023-06-26 10:13:49,275 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1547022.0, ans=0.0 2023-06-26 10:13:49,902 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.68 vs. limit=22.5 2023-06-26 10:13:51,160 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1547022.0, ans=0.2 2023-06-26 10:13:57,542 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.808e+02 5.512e+02 9.000e+02 1.173e+03 2.021e+03, threshold=1.800e+03, percent-clipped=1.0 2023-06-26 10:14:22,463 INFO [train.py:996] (0/4) Epoch 9, batch 13900, loss[loss=0.2227, simple_loss=0.2981, pruned_loss=0.0737, over 21413.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.3075, pruned_loss=0.07285, over 4272073.54 frames. ], batch size: 211, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:14:46,299 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1547202.0, ans=0.1 2023-06-26 10:15:54,309 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1547382.0, ans=0.1 2023-06-26 10:16:11,141 INFO [train.py:996] (0/4) Epoch 9, batch 13950, loss[loss=0.2225, simple_loss=0.2928, pruned_loss=0.07611, over 21803.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.3063, pruned_loss=0.07438, over 4280134.15 frames. ], batch size: 298, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:16:26,515 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.68 vs. limit=15.0 2023-06-26 10:16:29,487 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1547502.0, ans=0.125 2023-06-26 10:16:52,249 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 10:17:15,570 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1547562.0, ans=0.5 2023-06-26 10:17:19,505 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.39 vs. limit=12.0 2023-06-26 10:17:34,993 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.697e+02 5.601e+02 7.890e+02 1.100e+03 2.147e+03, threshold=1.578e+03, percent-clipped=2.0 2023-06-26 10:17:50,068 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.27 vs. limit=15.0 2023-06-26 10:17:56,224 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1547682.0, ans=0.125 2023-06-26 10:17:58,872 INFO [train.py:996] (0/4) Epoch 9, batch 14000, loss[loss=0.1955, simple_loss=0.2941, pruned_loss=0.04846, over 21399.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.3037, pruned_loss=0.07266, over 4277004.40 frames. ], batch size: 548, lr: 3.29e-03, grad_scale: 32.0 2023-06-26 10:18:05,662 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.32 vs. limit=15.0 2023-06-26 10:18:19,405 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.02 vs. limit=6.0 2023-06-26 10:18:20,257 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1547802.0, ans=0.125 2023-06-26 10:18:39,739 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.35 vs. limit=15.0 2023-06-26 10:19:02,276 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1547862.0, ans=0.125 2023-06-26 10:19:22,532 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1547982.0, ans=0.1 2023-06-26 10:19:31,395 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 10:19:46,313 INFO [train.py:996] (0/4) Epoch 9, batch 14050, loss[loss=0.1774, simple_loss=0.2613, pruned_loss=0.04681, over 21404.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.298, pruned_loss=0.06807, over 4273293.53 frames. ], batch size: 211, lr: 3.29e-03, grad_scale: 32.0 2023-06-26 10:20:03,183 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.50 vs. limit=15.0 2023-06-26 10:20:27,075 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1548162.0, ans=0.0 2023-06-26 10:21:06,332 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.027e+02 4.796e+02 7.490e+02 1.046e+03 2.202e+03, threshold=1.498e+03, percent-clipped=4.0 2023-06-26 10:21:30,926 INFO [train.py:996] (0/4) Epoch 9, batch 14100, loss[loss=0.2182, simple_loss=0.2944, pruned_loss=0.07096, over 21695.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2909, pruned_loss=0.06732, over 4273187.09 frames. ], batch size: 332, lr: 3.29e-03, grad_scale: 32.0 2023-06-26 10:22:03,794 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.07 vs. limit=12.0 2023-06-26 10:22:08,148 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1548402.0, ans=0.0 2023-06-26 10:22:34,596 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1548462.0, ans=0.0 2023-06-26 10:23:18,199 INFO [train.py:996] (0/4) Epoch 9, batch 14150, loss[loss=0.2074, simple_loss=0.2991, pruned_loss=0.0579, over 21628.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.2947, pruned_loss=0.06751, over 4259130.57 frames. ], batch size: 263, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:23:21,895 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1548642.0, ans=0.0 2023-06-26 10:23:27,289 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1548642.0, ans=0.125 2023-06-26 10:23:34,208 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1548702.0, ans=0.07 2023-06-26 10:23:47,568 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=1548702.0, ans=0.025 2023-06-26 10:24:24,187 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1548822.0, ans=0.125 2023-06-26 10:24:36,543 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1548822.0, ans=0.05 2023-06-26 10:24:40,164 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1548822.0, ans=0.0 2023-06-26 10:24:42,649 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.188e+02 5.825e+02 9.276e+02 1.325e+03 2.479e+03, threshold=1.855e+03, percent-clipped=15.0 2023-06-26 10:24:59,286 INFO [train.py:996] (0/4) Epoch 9, batch 14200, loss[loss=0.2292, simple_loss=0.3252, pruned_loss=0.06653, over 19903.00 frames. ], tot_loss[loss=0.2142, simple_loss=0.2947, pruned_loss=0.06687, over 4260254.74 frames. ], batch size: 702, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:26:04,144 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1549062.0, ans=0.2 2023-06-26 10:26:29,488 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff2.min_abs, batch_count=1549182.0, ans=0.1 2023-06-26 10:26:43,818 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1549182.0, ans=0.2 2023-06-26 10:26:47,090 INFO [train.py:996] (0/4) Epoch 9, batch 14250, loss[loss=0.2036, simple_loss=0.2776, pruned_loss=0.0648, over 21874.00 frames. ], tot_loss[loss=0.212, simple_loss=0.29, pruned_loss=0.06697, over 4269665.07 frames. ], batch size: 98, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:27:07,882 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1549242.0, ans=0.125 2023-06-26 10:27:54,973 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1549362.0, ans=0.025 2023-06-26 10:28:22,816 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.193e+02 4.868e+02 6.668e+02 9.362e+02 2.470e+03, threshold=1.334e+03, percent-clipped=6.0 2023-06-26 10:28:43,657 INFO [train.py:996] (0/4) Epoch 9, batch 14300, loss[loss=0.2999, simple_loss=0.3981, pruned_loss=0.1009, over 21679.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.292, pruned_loss=0.06673, over 4261346.57 frames. ], batch size: 414, lr: 3.29e-03, grad_scale: 8.0 2023-06-26 10:28:46,557 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.38 vs. limit=15.0 2023-06-26 10:29:23,654 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1549662.0, ans=0.1 2023-06-26 10:30:08,135 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1549722.0, ans=0.125 2023-06-26 10:30:11,372 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1549782.0, ans=0.125 2023-06-26 10:30:33,268 INFO [train.py:996] (0/4) Epoch 9, batch 14350, loss[loss=0.1913, simple_loss=0.278, pruned_loss=0.05229, over 21032.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2974, pruned_loss=0.06717, over 4250474.76 frames. ], batch size: 608, lr: 3.29e-03, grad_scale: 8.0 2023-06-26 10:31:29,870 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1549962.0, ans=0.125 2023-06-26 10:31:29,995 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1549962.0, ans=0.125 2023-06-26 10:31:51,019 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1550022.0, ans=0.125 2023-06-26 10:32:00,441 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.420e+02 5.713e+02 8.636e+02 1.390e+03 3.076e+03, threshold=1.727e+03, percent-clipped=28.0 2023-06-26 10:32:21,206 INFO [train.py:996] (0/4) Epoch 9, batch 14400, loss[loss=0.1528, simple_loss=0.2103, pruned_loss=0.04766, over 16907.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2946, pruned_loss=0.06751, over 4248342.20 frames. ], batch size: 61, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:32:21,825 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1550142.0, ans=0.125 2023-06-26 10:32:35,266 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1550142.0, ans=0.125 2023-06-26 10:33:32,634 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1550322.0, ans=0.125 2023-06-26 10:33:38,766 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.07 vs. limit=10.0 2023-06-26 10:34:03,155 INFO [train.py:996] (0/4) Epoch 9, batch 14450, loss[loss=0.2058, simple_loss=0.2805, pruned_loss=0.06553, over 21799.00 frames. ], tot_loss[loss=0.2118, simple_loss=0.2889, pruned_loss=0.06732, over 4253537.41 frames. ], batch size: 112, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:34:34,794 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1550502.0, ans=0.95 2023-06-26 10:34:36,267 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1550502.0, ans=0.0 2023-06-26 10:35:15,205 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1550622.0, ans=0.0 2023-06-26 10:35:36,886 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.305e+02 4.619e+02 5.727e+02 8.380e+02 1.480e+03, threshold=1.145e+03, percent-clipped=0.0 2023-06-26 10:35:43,932 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1550682.0, ans=0.2 2023-06-26 10:35:56,853 INFO [train.py:996] (0/4) Epoch 9, batch 14500, loss[loss=0.2084, simple_loss=0.3034, pruned_loss=0.05665, over 21264.00 frames. ], tot_loss[loss=0.2095, simple_loss=0.2848, pruned_loss=0.06704, over 4262069.51 frames. ], batch size: 176, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:36:00,802 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1550742.0, ans=0.0 2023-06-26 10:36:09,606 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1550742.0, ans=0.125 2023-06-26 10:36:16,530 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1550802.0, ans=0.0 2023-06-26 10:36:41,069 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff3.min_abs, batch_count=1550862.0, ans=0.2 2023-06-26 10:37:01,297 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.79 vs. limit=15.0 2023-06-26 10:37:15,097 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1550922.0, ans=0.125 2023-06-26 10:37:17,369 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.48 vs. limit=15.0 2023-06-26 10:37:46,730 INFO [train.py:996] (0/4) Epoch 9, batch 14550, loss[loss=0.3158, simple_loss=0.3725, pruned_loss=0.1296, over 21324.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2893, pruned_loss=0.06879, over 4265561.39 frames. ], batch size: 507, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:39:20,550 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.787e+02 5.550e+02 7.546e+02 1.212e+03 2.573e+03, threshold=1.509e+03, percent-clipped=29.0 2023-06-26 10:39:35,749 INFO [train.py:996] (0/4) Epoch 9, batch 14600, loss[loss=0.2283, simple_loss=0.303, pruned_loss=0.07684, over 21607.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.2979, pruned_loss=0.07252, over 4270212.07 frames. ], batch size: 263, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:40:28,602 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1551462.0, ans=0.125 2023-06-26 10:41:24,127 INFO [train.py:996] (0/4) Epoch 9, batch 14650, loss[loss=0.2185, simple_loss=0.2944, pruned_loss=0.07131, over 20112.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.3001, pruned_loss=0.07144, over 4254439.39 frames. ], batch size: 702, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:42:04,495 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1551762.0, ans=0.125 2023-06-26 10:42:13,396 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1551762.0, ans=0.125 2023-06-26 10:42:22,303 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1551762.0, ans=0.1 2023-06-26 10:42:36,329 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1551822.0, ans=0.1 2023-06-26 10:42:36,357 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1551822.0, ans=0.0 2023-06-26 10:42:46,543 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.116e+02 4.456e+02 7.843e+02 1.118e+03 1.924e+03, threshold=1.569e+03, percent-clipped=10.0 2023-06-26 10:43:07,372 INFO [train.py:996] (0/4) Epoch 9, batch 14700, loss[loss=0.2072, simple_loss=0.3031, pruned_loss=0.05562, over 21776.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2951, pruned_loss=0.06643, over 4256517.79 frames. ], batch size: 282, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:43:09,615 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1551942.0, ans=0.125 2023-06-26 10:43:36,485 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1552002.0, ans=0.0 2023-06-26 10:44:15,899 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1552122.0, ans=0.04949747468305833 2023-06-26 10:44:21,523 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1552122.0, ans=0.0 2023-06-26 10:44:54,769 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.30 vs. limit=15.0 2023-06-26 10:44:58,837 INFO [train.py:996] (0/4) Epoch 9, batch 14750, loss[loss=0.2431, simple_loss=0.3285, pruned_loss=0.07882, over 21487.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.2972, pruned_loss=0.06809, over 4260836.47 frames. ], batch size: 131, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:45:11,178 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1552242.0, ans=0.04949747468305833 2023-06-26 10:45:32,785 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1552302.0, ans=0.0 2023-06-26 10:45:42,648 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.08 vs. limit=22.5 2023-06-26 10:46:34,198 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.559e+02 5.782e+02 7.997e+02 1.225e+03 2.854e+03, threshold=1.599e+03, percent-clipped=14.0 2023-06-26 10:46:53,298 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.25 vs. limit=22.5 2023-06-26 10:46:55,533 INFO [train.py:996] (0/4) Epoch 9, batch 14800, loss[loss=0.2322, simple_loss=0.3092, pruned_loss=0.07766, over 21781.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3095, pruned_loss=0.07456, over 4260757.06 frames. ], batch size: 351, lr: 3.29e-03, grad_scale: 32.0 2023-06-26 10:47:05,562 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1552542.0, ans=0.125 2023-06-26 10:47:05,604 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1552542.0, ans=0.125 2023-06-26 10:47:40,987 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1552662.0, ans=0.0 2023-06-26 10:48:26,396 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1552722.0, ans=0.2 2023-06-26 10:48:59,080 INFO [train.py:996] (0/4) Epoch 9, batch 14850, loss[loss=0.2883, simple_loss=0.3599, pruned_loss=0.1084, over 21412.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.3034, pruned_loss=0.07382, over 4263991.67 frames. ], batch size: 471, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:49:10,239 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1552842.0, ans=0.015 2023-06-26 10:49:55,992 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.10 vs. limit=6.0 2023-06-26 10:50:23,891 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1553022.0, ans=0.0 2023-06-26 10:50:35,327 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.505e+02 5.144e+02 7.174e+02 1.026e+03 2.687e+03, threshold=1.435e+03, percent-clipped=5.0 2023-06-26 10:50:50,338 INFO [train.py:996] (0/4) Epoch 9, batch 14900, loss[loss=0.2719, simple_loss=0.3433, pruned_loss=0.1003, over 21270.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3056, pruned_loss=0.07499, over 4266669.26 frames. ], batch size: 143, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:51:27,063 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1553202.0, ans=0.0 2023-06-26 10:51:27,679 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.67 vs. limit=15.0 2023-06-26 10:51:42,264 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.43 vs. limit=15.0 2023-06-26 10:51:43,520 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1553262.0, ans=0.1 2023-06-26 10:51:50,196 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1553262.0, ans=0.125 2023-06-26 10:52:03,228 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1553322.0, ans=0.125 2023-06-26 10:52:42,198 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.33 vs. limit=22.5 2023-06-26 10:52:46,127 INFO [train.py:996] (0/4) Epoch 9, batch 14950, loss[loss=0.2125, simple_loss=0.3093, pruned_loss=0.05786, over 21234.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.3058, pruned_loss=0.07364, over 4262201.77 frames. ], batch size: 549, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:52:58,907 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1553442.0, ans=0.125 2023-06-26 10:53:10,253 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.89 vs. limit=6.0 2023-06-26 10:54:17,657 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.700e+02 5.284e+02 7.127e+02 1.003e+03 2.591e+03, threshold=1.425e+03, percent-clipped=12.0 2023-06-26 10:54:37,167 INFO [train.py:996] (0/4) Epoch 9, batch 15000, loss[loss=0.2157, simple_loss=0.2851, pruned_loss=0.07316, over 21325.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3093, pruned_loss=0.0758, over 4263308.40 frames. ], batch size: 159, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:54:37,168 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-26 10:54:55,449 INFO [train.py:1028] (0/4) Epoch 9, validation: loss=0.2558, simple_loss=0.3464, pruned_loss=0.08259, over 1796401.00 frames. 2023-06-26 10:54:55,450 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23714MB 2023-06-26 10:55:19,338 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1553802.0, ans=0.0 2023-06-26 10:55:30,933 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1553802.0, ans=0.0 2023-06-26 10:55:38,979 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1553802.0, ans=0.125 2023-06-26 10:56:07,433 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1553922.0, ans=0.125 2023-06-26 10:56:13,039 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1553922.0, ans=0.0 2023-06-26 10:56:46,884 INFO [train.py:996] (0/4) Epoch 9, batch 15050, loss[loss=0.2319, simple_loss=0.334, pruned_loss=0.06493, over 20701.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3102, pruned_loss=0.07618, over 4262221.99 frames. ], batch size: 607, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:57:22,367 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1554102.0, ans=0.125 2023-06-26 10:57:25,956 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1554102.0, ans=0.1 2023-06-26 10:58:08,549 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1554222.0, ans=0.07 2023-06-26 10:58:13,742 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1554222.0, ans=0.0 2023-06-26 10:58:21,872 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.687e+02 6.658e+02 1.222e+03 1.555e+03 2.780e+03, threshold=2.443e+03, percent-clipped=32.0 2023-06-26 10:58:41,257 INFO [train.py:996] (0/4) Epoch 9, batch 15100, loss[loss=0.2311, simple_loss=0.3029, pruned_loss=0.07967, over 21813.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.312, pruned_loss=0.07563, over 4262420.14 frames. ], batch size: 247, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:58:52,140 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1554342.0, ans=0.0 2023-06-26 10:59:22,764 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1554402.0, ans=0.1 2023-06-26 10:59:26,329 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 10:59:38,464 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1554462.0, ans=0.125 2023-06-26 11:00:29,604 INFO [train.py:996] (0/4) Epoch 9, batch 15150, loss[loss=0.2194, simple_loss=0.2797, pruned_loss=0.07954, over 21593.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3077, pruned_loss=0.07569, over 4267448.91 frames. ], batch size: 415, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 11:00:52,476 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1554702.0, ans=0.0 2023-06-26 11:01:45,671 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.18 vs. limit=15.0 2023-06-26 11:01:55,367 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1554882.0, ans=0.125 2023-06-26 11:02:05,233 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.559e+02 4.649e+02 7.475e+02 1.057e+03 2.217e+03, threshold=1.495e+03, percent-clipped=0.0 2023-06-26 11:02:19,239 INFO [train.py:996] (0/4) Epoch 9, batch 15200, loss[loss=0.1983, simple_loss=0.2925, pruned_loss=0.05209, over 21679.00 frames. ], tot_loss[loss=0.2219, simple_loss=0.3004, pruned_loss=0.07166, over 4260929.35 frames. ], batch size: 415, lr: 3.29e-03, grad_scale: 32.0 2023-06-26 11:02:36,347 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.28 vs. limit=22.5 2023-06-26 11:02:58,880 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1555002.0, ans=0.0 2023-06-26 11:03:02,434 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1555002.0, ans=0.125 2023-06-26 11:04:04,151 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1555182.0, ans=0.125 2023-06-26 11:04:12,988 INFO [train.py:996] (0/4) Epoch 9, batch 15250, loss[loss=0.1877, simple_loss=0.2488, pruned_loss=0.06331, over 21280.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2939, pruned_loss=0.07023, over 4259113.01 frames. ], batch size: 551, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 11:05:43,871 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.23 vs. limit=15.0 2023-06-26 11:05:44,372 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.438e+02 5.756e+02 7.926e+02 1.187e+03 2.967e+03, threshold=1.585e+03, percent-clipped=10.0 2023-06-26 11:06:02,508 INFO [train.py:996] (0/4) Epoch 9, batch 15300, loss[loss=0.2471, simple_loss=0.317, pruned_loss=0.08857, over 21444.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.2962, pruned_loss=0.07272, over 4254557.87 frames. ], batch size: 194, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 11:06:23,379 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1555542.0, ans=0.0 2023-06-26 11:06:25,609 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=1555602.0, ans=15.0 2023-06-26 11:06:42,004 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.90 vs. limit=15.0 2023-06-26 11:07:05,839 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1555722.0, ans=0.0 2023-06-26 11:07:52,677 INFO [train.py:996] (0/4) Epoch 9, batch 15350, loss[loss=0.2213, simple_loss=0.3224, pruned_loss=0.06008, over 21844.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.2992, pruned_loss=0.07365, over 4258988.80 frames. ], batch size: 371, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:08:06,114 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.12 vs. limit=15.0 2023-06-26 11:08:58,651 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1556022.0, ans=0.125 2023-06-26 11:08:59,243 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.28 vs. limit=22.5 2023-06-26 11:09:22,276 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.533e+02 5.256e+02 7.334e+02 1.092e+03 2.120e+03, threshold=1.467e+03, percent-clipped=2.0 2023-06-26 11:09:32,324 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.35 vs. limit=15.0 2023-06-26 11:09:39,833 INFO [train.py:996] (0/4) Epoch 9, batch 15400, loss[loss=0.2126, simple_loss=0.2918, pruned_loss=0.06663, over 21867.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.2994, pruned_loss=0.07186, over 4266264.68 frames. ], batch size: 371, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:09:44,693 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.69 vs. limit=10.0 2023-06-26 11:09:56,860 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.64 vs. limit=6.0 2023-06-26 11:10:22,824 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.71 vs. limit=15.0 2023-06-26 11:10:56,787 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1556322.0, ans=0.125 2023-06-26 11:11:18,798 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1556382.0, ans=0.125 2023-06-26 11:11:23,608 INFO [train.py:996] (0/4) Epoch 9, batch 15450, loss[loss=0.2035, simple_loss=0.2949, pruned_loss=0.05604, over 21848.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.2971, pruned_loss=0.07069, over 4261180.86 frames. ], batch size: 351, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:12:02,230 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 11:12:42,712 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1556622.0, ans=0.2 2023-06-26 11:12:59,146 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1556682.0, ans=0.1 2023-06-26 11:13:01,353 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.52 vs. limit=15.0 2023-06-26 11:13:01,733 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.285e+02 4.674e+02 6.020e+02 7.889e+02 1.710e+03, threshold=1.204e+03, percent-clipped=2.0 2023-06-26 11:13:20,022 INFO [train.py:996] (0/4) Epoch 9, batch 15500, loss[loss=0.2474, simple_loss=0.3271, pruned_loss=0.08385, over 21281.00 frames. ], tot_loss[loss=0.221, simple_loss=0.3003, pruned_loss=0.07086, over 4250945.75 frames. ], batch size: 159, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:13:21,292 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.38 vs. limit=15.0 2023-06-26 11:13:21,339 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.03 vs. limit=15.0 2023-06-26 11:13:24,453 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1556742.0, ans=0.125 2023-06-26 11:14:02,193 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1556862.0, ans=0.2 2023-06-26 11:14:02,219 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1556862.0, ans=0.1 2023-06-26 11:14:36,762 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1556922.0, ans=0.1 2023-06-26 11:14:38,906 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1556922.0, ans=0.0 2023-06-26 11:15:11,427 INFO [train.py:996] (0/4) Epoch 9, batch 15550, loss[loss=0.1992, simple_loss=0.2826, pruned_loss=0.05787, over 21708.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.2998, pruned_loss=0.06921, over 4255057.40 frames. ], batch size: 332, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:15:22,704 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 11:15:23,224 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.45 vs. limit=15.0 2023-06-26 11:15:33,277 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.86 vs. limit=6.0 2023-06-26 11:15:34,982 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.13 vs. limit=22.5 2023-06-26 11:16:41,914 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.382e+02 5.062e+02 7.091e+02 1.054e+03 2.391e+03, threshold=1.418e+03, percent-clipped=18.0 2023-06-26 11:16:53,056 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.95 vs. limit=15.0 2023-06-26 11:16:55,601 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1557282.0, ans=0.0 2023-06-26 11:16:59,946 INFO [train.py:996] (0/4) Epoch 9, batch 15600, loss[loss=0.1909, simple_loss=0.2773, pruned_loss=0.05224, over 21663.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.2934, pruned_loss=0.06786, over 4255288.38 frames. ], batch size: 247, lr: 3.28e-03, grad_scale: 32.0 2023-06-26 11:17:04,432 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1557342.0, ans=0.0 2023-06-26 11:17:07,648 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1557342.0, ans=0.5 2023-06-26 11:17:28,744 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1557402.0, ans=0.0 2023-06-26 11:17:39,921 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.78 vs. limit=15.0 2023-06-26 11:18:05,005 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1557522.0, ans=0.2 2023-06-26 11:18:05,643 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.69 vs. limit=10.0 2023-06-26 11:18:48,390 INFO [train.py:996] (0/4) Epoch 9, batch 15650, loss[loss=0.2049, simple_loss=0.2682, pruned_loss=0.07078, over 21726.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2931, pruned_loss=0.06791, over 4257605.66 frames. ], batch size: 371, lr: 3.28e-03, grad_scale: 32.0 2023-06-26 11:19:08,711 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1557642.0, ans=0.0 2023-06-26 11:19:15,300 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1557702.0, ans=0.0 2023-06-26 11:20:17,631 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1557822.0, ans=0.0 2023-06-26 11:20:19,683 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 11:20:25,560 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.281e+02 4.437e+02 5.415e+02 7.572e+02 1.667e+03, threshold=1.083e+03, percent-clipped=3.0 2023-06-26 11:20:43,533 INFO [train.py:996] (0/4) Epoch 9, batch 15700, loss[loss=0.1861, simple_loss=0.2632, pruned_loss=0.05449, over 21645.00 frames. ], tot_loss[loss=0.2121, simple_loss=0.2894, pruned_loss=0.06738, over 4252729.80 frames. ], batch size: 263, lr: 3.28e-03, grad_scale: 32.0 2023-06-26 11:20:59,483 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1558002.0, ans=0.1 2023-06-26 11:22:30,894 INFO [train.py:996] (0/4) Epoch 9, batch 15750, loss[loss=0.2082, simple_loss=0.2775, pruned_loss=0.06943, over 21457.00 frames. ], tot_loss[loss=0.2094, simple_loss=0.2853, pruned_loss=0.06679, over 4241957.34 frames. ], batch size: 389, lr: 3.28e-03, grad_scale: 32.0 2023-06-26 11:23:04,743 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1558362.0, ans=0.2 2023-06-26 11:23:43,526 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1558422.0, ans=0.125 2023-06-26 11:24:01,250 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.153e+02 4.399e+02 6.641e+02 9.028e+02 1.552e+03, threshold=1.328e+03, percent-clipped=11.0 2023-06-26 11:24:03,857 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff2.min_abs, batch_count=1558482.0, ans=0.1 2023-06-26 11:24:18,401 INFO [train.py:996] (0/4) Epoch 9, batch 15800, loss[loss=0.1893, simple_loss=0.2542, pruned_loss=0.06217, over 21489.00 frames. ], tot_loss[loss=0.207, simple_loss=0.2812, pruned_loss=0.06642, over 4239657.96 frames. ], batch size: 212, lr: 3.28e-03, grad_scale: 32.0 2023-06-26 11:24:40,683 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.14 vs. limit=22.5 2023-06-26 11:24:45,376 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1558602.0, ans=0.125 2023-06-26 11:25:10,209 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1558662.0, ans=0.125 2023-06-26 11:25:20,042 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1558722.0, ans=0.035 2023-06-26 11:25:35,903 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1558722.0, ans=0.125 2023-06-26 11:26:06,290 INFO [train.py:996] (0/4) Epoch 9, batch 15850, loss[loss=0.2217, simple_loss=0.2984, pruned_loss=0.07249, over 21959.00 frames. ], tot_loss[loss=0.2091, simple_loss=0.2822, pruned_loss=0.06799, over 4251955.60 frames. ], batch size: 317, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:26:06,894 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1558842.0, ans=0.125 2023-06-26 11:27:38,953 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.367e+02 5.055e+02 6.778e+02 9.936e+02 2.216e+03, threshold=1.356e+03, percent-clipped=9.0 2023-06-26 11:27:49,539 INFO [train.py:996] (0/4) Epoch 9, batch 15900, loss[loss=0.2094, simple_loss=0.2971, pruned_loss=0.06088, over 21653.00 frames. ], tot_loss[loss=0.2078, simple_loss=0.2803, pruned_loss=0.06772, over 4238176.43 frames. ], batch size: 298, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:28:18,389 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1559202.0, ans=10.0 2023-06-26 11:28:23,589 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1559202.0, ans=0.125 2023-06-26 11:29:36,124 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1559382.0, ans=0.05 2023-06-26 11:29:38,929 INFO [train.py:996] (0/4) Epoch 9, batch 15950, loss[loss=0.1678, simple_loss=0.2675, pruned_loss=0.03402, over 21616.00 frames. ], tot_loss[loss=0.2073, simple_loss=0.2817, pruned_loss=0.0664, over 4233238.35 frames. ], batch size: 263, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:29:41,205 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 11:29:41,356 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1559442.0, ans=0.0 2023-06-26 11:30:39,033 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=6.79 vs. limit=22.5 2023-06-26 11:31:17,727 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.330e+02 4.974e+02 7.474e+02 9.810e+02 2.700e+03, threshold=1.495e+03, percent-clipped=8.0 2023-06-26 11:31:28,107 INFO [train.py:996] (0/4) Epoch 9, batch 16000, loss[loss=0.2574, simple_loss=0.3324, pruned_loss=0.09118, over 21585.00 frames. ], tot_loss[loss=0.207, simple_loss=0.2831, pruned_loss=0.0655, over 4245520.73 frames. ], batch size: 471, lr: 3.28e-03, grad_scale: 32.0 2023-06-26 11:31:30,912 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1559742.0, ans=0.125 2023-06-26 11:31:57,408 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.84 vs. limit=22.5 2023-06-26 11:32:00,408 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1559802.0, ans=0.125 2023-06-26 11:32:11,301 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1559862.0, ans=0.125 2023-06-26 11:32:23,855 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1559862.0, ans=0.0 2023-06-26 11:32:35,935 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1559922.0, ans=0.0 2023-06-26 11:32:53,300 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-260000.pt 2023-06-26 11:33:14,623 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1559982.0, ans=0.0 2023-06-26 11:33:17,709 INFO [train.py:996] (0/4) Epoch 9, batch 16050, loss[loss=0.3303, simple_loss=0.453, pruned_loss=0.1039, over 19780.00 frames. ], tot_loss[loss=0.2057, simple_loss=0.2853, pruned_loss=0.06299, over 4257646.40 frames. ], batch size: 702, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:33:46,070 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.01 vs. limit=22.5 2023-06-26 11:34:45,488 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.405e+02 5.738e+02 8.747e+02 1.434e+03 3.009e+03, threshold=1.749e+03, percent-clipped=21.0 2023-06-26 11:35:00,964 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1560282.0, ans=0.1 2023-06-26 11:35:05,355 INFO [train.py:996] (0/4) Epoch 9, batch 16100, loss[loss=0.2282, simple_loss=0.2938, pruned_loss=0.08129, over 21541.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.2913, pruned_loss=0.06496, over 4271716.39 frames. ], batch size: 548, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:35:23,038 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1560342.0, ans=0.2 2023-06-26 11:35:27,091 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.50 vs. limit=6.0 2023-06-26 11:36:48,172 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1560582.0, ans=0.125 2023-06-26 11:36:54,149 INFO [train.py:996] (0/4) Epoch 9, batch 16150, loss[loss=0.2178, simple_loss=0.286, pruned_loss=0.07475, over 21931.00 frames. ], tot_loss[loss=0.2121, simple_loss=0.2922, pruned_loss=0.06605, over 4279433.31 frames. ], batch size: 351, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:37:13,262 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1560642.0, ans=0.125 2023-06-26 11:37:39,002 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1560762.0, ans=0.015 2023-06-26 11:37:57,298 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1560822.0, ans=0.1 2023-06-26 11:38:19,764 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1560882.0, ans=0.125 2023-06-26 11:38:22,976 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1560882.0, ans=0.125 2023-06-26 11:38:33,307 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.369e+02 5.523e+02 8.339e+02 1.289e+03 2.279e+03, threshold=1.668e+03, percent-clipped=10.0 2023-06-26 11:38:46,827 INFO [train.py:996] (0/4) Epoch 9, batch 16200, loss[loss=0.223, simple_loss=0.313, pruned_loss=0.06647, over 21456.00 frames. ], tot_loss[loss=0.216, simple_loss=0.2964, pruned_loss=0.06782, over 4289190.73 frames. ], batch size: 211, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:40:12,409 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1561122.0, ans=0.125 2023-06-26 11:40:25,159 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1561182.0, ans=0.125 2023-06-26 11:40:27,434 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.57 vs. limit=15.0 2023-06-26 11:40:38,365 INFO [train.py:996] (0/4) Epoch 9, batch 16250, loss[loss=0.2234, simple_loss=0.304, pruned_loss=0.07138, over 21331.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.2971, pruned_loss=0.06838, over 4290245.68 frames. ], batch size: 549, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:40:55,701 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1561302.0, ans=0.125 2023-06-26 11:42:17,695 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.175e+02 4.964e+02 6.149e+02 9.832e+02 2.311e+03, threshold=1.230e+03, percent-clipped=3.0 2023-06-26 11:42:26,975 INFO [train.py:996] (0/4) Epoch 9, batch 16300, loss[loss=0.168, simple_loss=0.2384, pruned_loss=0.04884, over 21820.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.2912, pruned_loss=0.06502, over 4278024.27 frames. ], batch size: 107, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:43:40,381 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1561722.0, ans=0.0 2023-06-26 11:44:17,115 INFO [train.py:996] (0/4) Epoch 9, batch 16350, loss[loss=0.2415, simple_loss=0.3199, pruned_loss=0.08149, over 21902.00 frames. ], tot_loss[loss=0.213, simple_loss=0.2927, pruned_loss=0.06665, over 4276367.62 frames. ], batch size: 372, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:44:17,722 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1561842.0, ans=0.125 2023-06-26 11:44:19,444 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1561842.0, ans=0.1 2023-06-26 11:44:28,641 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1561842.0, ans=0.125 2023-06-26 11:44:28,645 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1561842.0, ans=0.04949747468305833 2023-06-26 11:44:30,533 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1561842.0, ans=0.125 2023-06-26 11:44:44,623 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1561902.0, ans=0.125 2023-06-26 11:45:11,929 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1561962.0, ans=0.125 2023-06-26 11:45:12,016 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1561962.0, ans=0.0 2023-06-26 11:45:12,019 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1561962.0, ans=0.0 2023-06-26 11:45:56,602 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.631e+02 4.777e+02 5.847e+02 7.634e+02 1.657e+03, threshold=1.169e+03, percent-clipped=4.0 2023-06-26 11:46:05,012 INFO [train.py:996] (0/4) Epoch 9, batch 16400, loss[loss=0.2702, simple_loss=0.3247, pruned_loss=0.1079, over 21703.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2936, pruned_loss=0.0673, over 4262673.24 frames. ], batch size: 507, lr: 3.28e-03, grad_scale: 32.0 2023-06-26 11:46:09,494 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 11:47:41,328 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.86 vs. limit=12.0 2023-06-26 11:47:47,939 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=18.50 vs. limit=15.0 2023-06-26 11:47:54,202 INFO [train.py:996] (0/4) Epoch 9, batch 16450, loss[loss=0.2197, simple_loss=0.3001, pruned_loss=0.0696, over 21553.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2952, pruned_loss=0.06886, over 4269275.25 frames. ], batch size: 131, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:48:53,063 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1562562.0, ans=0.0 2023-06-26 11:49:27,571 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.14 vs. limit=12.0 2023-06-26 11:49:36,734 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.419e+02 4.948e+02 6.287e+02 8.709e+02 1.538e+03, threshold=1.257e+03, percent-clipped=9.0 2023-06-26 11:49:44,351 INFO [train.py:996] (0/4) Epoch 9, batch 16500, loss[loss=0.1879, simple_loss=0.2474, pruned_loss=0.06421, over 21257.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.294, pruned_loss=0.06951, over 4275775.19 frames. ], batch size: 176, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:50:21,007 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=1562802.0, ans=10.0 2023-06-26 11:50:22,759 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1562802.0, ans=0.0 2023-06-26 11:50:22,901 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1562802.0, ans=0.0 2023-06-26 11:50:24,605 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1562862.0, ans=0.1 2023-06-26 11:50:44,117 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1562862.0, ans=0.1 2023-06-26 11:51:34,685 INFO [train.py:996] (0/4) Epoch 9, batch 16550, loss[loss=0.2265, simple_loss=0.3131, pruned_loss=0.0699, over 21630.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2964, pruned_loss=0.06915, over 4261020.51 frames. ], batch size: 414, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:51:54,435 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.28 vs. limit=22.5 2023-06-26 11:52:08,410 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1563102.0, ans=0.0 2023-06-26 11:53:13,721 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.12 vs. limit=15.0 2023-06-26 11:53:24,965 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.329e+02 6.129e+02 9.986e+02 1.624e+03 3.562e+03, threshold=1.997e+03, percent-clipped=34.0 2023-06-26 11:53:31,910 INFO [train.py:996] (0/4) Epoch 9, batch 16600, loss[loss=0.2621, simple_loss=0.3786, pruned_loss=0.07278, over 19732.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.3046, pruned_loss=0.07179, over 4260234.21 frames. ], batch size: 702, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:53:57,933 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1563342.0, ans=0.125 2023-06-26 11:54:39,163 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1563462.0, ans=0.0 2023-06-26 11:55:10,008 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1563582.0, ans=0.95 2023-06-26 11:55:15,625 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.91 vs. limit=15.0 2023-06-26 11:55:29,111 INFO [train.py:996] (0/4) Epoch 9, batch 16650, loss[loss=0.24, simple_loss=0.3188, pruned_loss=0.08055, over 21787.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3136, pruned_loss=0.07446, over 4269029.58 frames. ], batch size: 298, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:56:09,568 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1563702.0, ans=0.125 2023-06-26 11:56:11,070 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1563702.0, ans=0.125 2023-06-26 11:56:25,770 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1563762.0, ans=0.0 2023-06-26 11:56:40,570 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1563822.0, ans=0.125 2023-06-26 11:57:05,036 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.69 vs. limit=12.0 2023-06-26 11:57:21,322 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.537e+02 4.947e+02 6.891e+02 9.517e+02 1.890e+03, threshold=1.378e+03, percent-clipped=0.0 2023-06-26 11:57:31,275 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.33 vs. limit=15.0 2023-06-26 11:57:33,720 INFO [train.py:996] (0/4) Epoch 9, batch 16700, loss[loss=0.2158, simple_loss=0.2953, pruned_loss=0.06814, over 20714.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3138, pruned_loss=0.07493, over 4267515.00 frames. ], batch size: 607, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:57:53,458 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.52 vs. limit=15.0 2023-06-26 11:58:16,686 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1564062.0, ans=0.1 2023-06-26 11:58:29,917 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1564062.0, ans=0.0 2023-06-26 11:59:22,895 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.29 vs. limit=12.0 2023-06-26 11:59:29,011 INFO [train.py:996] (0/4) Epoch 9, batch 16750, loss[loss=0.2471, simple_loss=0.3261, pruned_loss=0.0841, over 21787.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3149, pruned_loss=0.07652, over 4262734.00 frames. ], batch size: 124, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:59:43,230 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1564242.0, ans=0.125 2023-06-26 11:59:56,635 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1564302.0, ans=0.1 2023-06-26 12:00:31,312 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.89 vs. limit=15.0 2023-06-26 12:01:05,541 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1564482.0, ans=0.05 2023-06-26 12:01:09,400 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1564482.0, ans=0.2 2023-06-26 12:01:13,734 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.747e+02 5.518e+02 7.489e+02 1.102e+03 1.868e+03, threshold=1.498e+03, percent-clipped=9.0 2023-06-26 12:01:17,725 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=1564482.0, ans=0.05 2023-06-26 12:01:20,315 INFO [train.py:996] (0/4) Epoch 9, batch 16800, loss[loss=0.2545, simple_loss=0.3311, pruned_loss=0.08896, over 21597.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3192, pruned_loss=0.07672, over 4262398.34 frames. ], batch size: 471, lr: 3.28e-03, grad_scale: 32.0 2023-06-26 12:02:28,418 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.83 vs. limit=15.0 2023-06-26 12:03:09,576 INFO [train.py:996] (0/4) Epoch 9, batch 16850, loss[loss=0.2041, simple_loss=0.272, pruned_loss=0.06807, over 21565.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3138, pruned_loss=0.07616, over 4267515.80 frames. ], batch size: 195, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 12:03:21,485 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.09 vs. limit=12.0 2023-06-26 12:03:39,690 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1564902.0, ans=0.125 2023-06-26 12:04:07,276 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.27 vs. limit=6.0 2023-06-26 12:04:16,771 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1564962.0, ans=0.1 2023-06-26 12:04:29,138 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1565022.0, ans=0.125 2023-06-26 12:04:35,654 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1565022.0, ans=0.125 2023-06-26 12:04:52,108 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.430e+02 5.138e+02 7.609e+02 1.062e+03 2.399e+03, threshold=1.522e+03, percent-clipped=7.0 2023-06-26 12:05:02,265 INFO [train.py:996] (0/4) Epoch 9, batch 16900, loss[loss=0.1852, simple_loss=0.2621, pruned_loss=0.0541, over 21654.00 frames. ], tot_loss[loss=0.229, simple_loss=0.3093, pruned_loss=0.07439, over 4271051.81 frames. ], batch size: 332, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:06:20,537 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.60 vs. limit=15.0 2023-06-26 12:06:43,802 INFO [train.py:996] (0/4) Epoch 9, batch 16950, loss[loss=0.1923, simple_loss=0.2657, pruned_loss=0.0595, over 21823.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.3012, pruned_loss=0.07237, over 4268938.41 frames. ], batch size: 282, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:08:09,811 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1565622.0, ans=0.125 2023-06-26 12:08:21,652 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.76 vs. limit=15.0 2023-06-26 12:08:23,165 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.28 vs. limit=6.0 2023-06-26 12:08:27,260 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.850e+02 5.163e+02 6.810e+02 8.799e+02 2.047e+03, threshold=1.362e+03, percent-clipped=3.0 2023-06-26 12:08:32,648 INFO [train.py:996] (0/4) Epoch 9, batch 17000, loss[loss=0.1953, simple_loss=0.2675, pruned_loss=0.06156, over 21681.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.2978, pruned_loss=0.0726, over 4274709.47 frames. ], batch size: 263, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:09:04,533 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1565802.0, ans=0.125 2023-06-26 12:09:59,256 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1565922.0, ans=0.05 2023-06-26 12:10:29,873 INFO [train.py:996] (0/4) Epoch 9, batch 17050, loss[loss=0.1875, simple_loss=0.2486, pruned_loss=0.06319, over 20231.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.3043, pruned_loss=0.07462, over 4281211.36 frames. ], batch size: 703, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:11:05,417 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1566102.0, ans=0.1 2023-06-26 12:12:02,442 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1566282.0, ans=0.2 2023-06-26 12:12:06,992 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.631e+02 5.718e+02 8.769e+02 1.372e+03 2.605e+03, threshold=1.754e+03, percent-clipped=26.0 2023-06-26 12:12:17,823 INFO [train.py:996] (0/4) Epoch 9, batch 17100, loss[loss=0.1939, simple_loss=0.2645, pruned_loss=0.0616, over 21685.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.3021, pruned_loss=0.07448, over 4291557.30 frames. ], batch size: 263, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:12:56,990 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1566402.0, ans=0.2 2023-06-26 12:13:28,545 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1566522.0, ans=0.1 2023-06-26 12:13:31,704 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1566522.0, ans=0.125 2023-06-26 12:13:36,862 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1566522.0, ans=0.125 2023-06-26 12:13:51,544 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.28 vs. limit=22.5 2023-06-26 12:14:10,890 INFO [train.py:996] (0/4) Epoch 9, batch 17150, loss[loss=0.1992, simple_loss=0.268, pruned_loss=0.06519, over 21395.00 frames. ], tot_loss[loss=0.223, simple_loss=0.299, pruned_loss=0.07349, over 4284268.89 frames. ], batch size: 176, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:14:34,428 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1566702.0, ans=0.125 2023-06-26 12:15:01,244 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1566762.0, ans=0.5 2023-06-26 12:15:55,039 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.574e+02 4.824e+02 6.813e+02 1.101e+03 2.342e+03, threshold=1.363e+03, percent-clipped=2.0 2023-06-26 12:15:59,570 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1566942.0, ans=0.025 2023-06-26 12:16:00,485 INFO [train.py:996] (0/4) Epoch 9, batch 17200, loss[loss=0.2103, simple_loss=0.2904, pruned_loss=0.06507, over 21739.00 frames. ], tot_loss[loss=0.222, simple_loss=0.2985, pruned_loss=0.07271, over 4284348.83 frames. ], batch size: 298, lr: 3.27e-03, grad_scale: 32.0 2023-06-26 12:16:33,031 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1567002.0, ans=0.125 2023-06-26 12:16:36,605 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1567002.0, ans=0.0 2023-06-26 12:16:46,249 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.61 vs. limit=15.0 2023-06-26 12:17:08,605 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1567122.0, ans=0.2 2023-06-26 12:18:02,437 INFO [train.py:996] (0/4) Epoch 9, batch 17250, loss[loss=0.2685, simple_loss=0.3358, pruned_loss=0.1006, over 21805.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.3014, pruned_loss=0.07438, over 4280141.30 frames. ], batch size: 441, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:18:10,632 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.77 vs. limit=15.0 2023-06-26 12:18:15,747 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1567242.0, ans=0.2 2023-06-26 12:18:19,810 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.80 vs. limit=22.5 2023-06-26 12:18:42,611 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1567362.0, ans=0.125 2023-06-26 12:19:43,782 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1567482.0, ans=0.125 2023-06-26 12:19:48,792 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.788e+02 5.514e+02 7.810e+02 1.291e+03 2.321e+03, threshold=1.562e+03, percent-clipped=17.0 2023-06-26 12:19:52,288 INFO [train.py:996] (0/4) Epoch 9, batch 17300, loss[loss=0.2453, simple_loss=0.3184, pruned_loss=0.08614, over 21929.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.3089, pruned_loss=0.0779, over 4277625.12 frames. ], batch size: 372, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:20:00,764 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.91 vs. limit=15.0 2023-06-26 12:20:11,459 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1567602.0, ans=0.2 2023-06-26 12:20:18,969 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1567602.0, ans=0.1 2023-06-26 12:20:40,443 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1567662.0, ans=0.125 2023-06-26 12:20:49,601 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.25 vs. limit=15.0 2023-06-26 12:21:25,160 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1567782.0, ans=0.2 2023-06-26 12:21:38,680 INFO [train.py:996] (0/4) Epoch 9, batch 17350, loss[loss=0.1774, simple_loss=0.2656, pruned_loss=0.04456, over 21432.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3087, pruned_loss=0.07747, over 4275268.28 frames. ], batch size: 211, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:21:50,368 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1567842.0, ans=0.0 2023-06-26 12:22:26,833 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.27 vs. limit=15.0 2023-06-26 12:22:57,042 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1568022.0, ans=0.0 2023-06-26 12:23:15,918 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.405e+02 5.455e+02 8.630e+02 1.274e+03 2.528e+03, threshold=1.726e+03, percent-clipped=16.0 2023-06-26 12:23:19,232 INFO [train.py:996] (0/4) Epoch 9, batch 17400, loss[loss=0.213, simple_loss=0.3062, pruned_loss=0.05988, over 21739.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.3051, pruned_loss=0.07424, over 4267178.89 frames. ], batch size: 332, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:23:42,491 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1568202.0, ans=0.0 2023-06-26 12:23:55,172 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.24 vs. limit=15.0 2023-06-26 12:23:56,605 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1568202.0, ans=0.125 2023-06-26 12:23:59,950 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1568262.0, ans=0.1 2023-06-26 12:24:06,742 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1568262.0, ans=0.0 2023-06-26 12:24:07,342 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.64 vs. limit=22.5 2023-06-26 12:24:55,813 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1568382.0, ans=0.125 2023-06-26 12:25:10,962 INFO [train.py:996] (0/4) Epoch 9, batch 17450, loss[loss=0.1887, simple_loss=0.2822, pruned_loss=0.04754, over 21628.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.302, pruned_loss=0.07173, over 4273923.34 frames. ], batch size: 247, lr: 3.27e-03, grad_scale: 8.0 2023-06-26 12:25:11,570 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1568442.0, ans=0.1 2023-06-26 12:26:57,101 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.248e+02 4.686e+02 6.725e+02 1.029e+03 2.928e+03, threshold=1.345e+03, percent-clipped=7.0 2023-06-26 12:26:58,681 INFO [train.py:996] (0/4) Epoch 9, batch 17500, loss[loss=0.2025, simple_loss=0.2758, pruned_loss=0.0646, over 21671.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.2985, pruned_loss=0.06923, over 4273601.11 frames. ], batch size: 230, lr: 3.27e-03, grad_scale: 8.0 2023-06-26 12:27:06,127 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1568742.0, ans=0.09899494936611666 2023-06-26 12:27:13,428 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1568742.0, ans=0.125 2023-06-26 12:27:47,896 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1568862.0, ans=0.0 2023-06-26 12:28:19,036 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1568982.0, ans=0.0 2023-06-26 12:28:22,569 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1568982.0, ans=0.0 2023-06-26 12:28:41,014 INFO [train.py:996] (0/4) Epoch 9, batch 17550, loss[loss=0.2185, simple_loss=0.3068, pruned_loss=0.06511, over 21395.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2985, pruned_loss=0.0677, over 4269286.11 frames. ], batch size: 131, lr: 3.27e-03, grad_scale: 8.0 2023-06-26 12:29:27,468 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1569162.0, ans=0.125 2023-06-26 12:30:10,000 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1569282.0, ans=0.2 2023-06-26 12:30:34,070 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.446e+02 4.559e+02 6.477e+02 8.639e+02 1.603e+03, threshold=1.295e+03, percent-clipped=2.0 2023-06-26 12:30:35,798 INFO [train.py:996] (0/4) Epoch 9, batch 17600, loss[loss=0.2413, simple_loss=0.3117, pruned_loss=0.08541, over 21724.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.3009, pruned_loss=0.06782, over 4266547.84 frames. ], batch size: 298, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:30:42,632 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.25 vs. limit=15.0 2023-06-26 12:31:23,122 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1569462.0, ans=0.125 2023-06-26 12:31:24,762 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1569462.0, ans=0.125 2023-06-26 12:31:44,511 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1569522.0, ans=0.025 2023-06-26 12:32:16,596 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1569582.0, ans=0.125 2023-06-26 12:32:21,710 INFO [train.py:996] (0/4) Epoch 9, batch 17650, loss[loss=0.239, simple_loss=0.3274, pruned_loss=0.07531, over 21473.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.2996, pruned_loss=0.0684, over 4245652.95 frames. ], batch size: 131, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:32:30,659 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1569642.0, ans=0.2 2023-06-26 12:32:53,196 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1569702.0, ans=0.2 2023-06-26 12:33:11,018 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1569762.0, ans=0.125 2023-06-26 12:33:32,332 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1569822.0, ans=0.04949747468305833 2023-06-26 12:34:09,281 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.770e+02 6.180e+02 8.586e+02 1.472e+03 2.723e+03, threshold=1.717e+03, percent-clipped=31.0 2023-06-26 12:34:10,909 INFO [train.py:996] (0/4) Epoch 9, batch 17700, loss[loss=0.2888, simple_loss=0.3577, pruned_loss=0.1099, over 21448.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2944, pruned_loss=0.06718, over 4252536.49 frames. ], batch size: 471, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:34:33,423 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.20 vs. limit=6.0 2023-06-26 12:34:42,622 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1570002.0, ans=0.125 2023-06-26 12:34:42,710 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1570002.0, ans=0.125 2023-06-26 12:35:02,710 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_na.min_abs, batch_count=1570062.0, ans=0.02 2023-06-26 12:35:57,622 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1570182.0, ans=0.1 2023-06-26 12:36:06,995 INFO [train.py:996] (0/4) Epoch 9, batch 17750, loss[loss=0.2957, simple_loss=0.3579, pruned_loss=0.1168, over 21470.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.3006, pruned_loss=0.07021, over 4255484.81 frames. ], batch size: 471, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:36:11,277 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1570242.0, ans=0.0 2023-06-26 12:36:26,285 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=6.07 vs. limit=22.5 2023-06-26 12:37:43,245 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1570482.0, ans=0.0 2023-06-26 12:37:56,887 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.878e+02 5.343e+02 8.043e+02 1.136e+03 2.008e+03, threshold=1.609e+03, percent-clipped=5.0 2023-06-26 12:38:04,115 INFO [train.py:996] (0/4) Epoch 9, batch 17800, loss[loss=0.3, simple_loss=0.3724, pruned_loss=0.1138, over 21414.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.2998, pruned_loss=0.06988, over 4257813.11 frames. ], batch size: 507, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:38:23,840 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.07 vs. limit=15.0 2023-06-26 12:38:39,948 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.40 vs. limit=22.5 2023-06-26 12:39:15,211 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1570722.0, ans=0.125 2023-06-26 12:39:55,327 INFO [train.py:996] (0/4) Epoch 9, batch 17850, loss[loss=0.2843, simple_loss=0.3484, pruned_loss=0.11, over 21740.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.3006, pruned_loss=0.0705, over 4266513.46 frames. ], batch size: 441, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:39:56,263 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.24 vs. limit=15.0 2023-06-26 12:40:06,151 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1570842.0, ans=0.0 2023-06-26 12:40:23,397 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1570902.0, ans=0.125 2023-06-26 12:40:37,705 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff2.min_abs, batch_count=1570902.0, ans=0.1 2023-06-26 12:40:48,932 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.07 vs. limit=15.0 2023-06-26 12:41:39,379 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1571082.0, ans=0.0 2023-06-26 12:41:42,264 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.429e+02 5.491e+02 8.059e+02 1.156e+03 1.916e+03, threshold=1.612e+03, percent-clipped=10.0 2023-06-26 12:41:43,908 INFO [train.py:996] (0/4) Epoch 9, batch 17900, loss[loss=0.2222, simple_loss=0.3204, pruned_loss=0.062, over 21778.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.3051, pruned_loss=0.07198, over 4269795.86 frames. ], batch size: 282, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:43:13,380 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1571322.0, ans=0.2 2023-06-26 12:43:40,958 INFO [train.py:996] (0/4) Epoch 9, batch 17950, loss[loss=0.2027, simple_loss=0.2959, pruned_loss=0.05475, over 21628.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.3034, pruned_loss=0.06864, over 4268896.79 frames. ], batch size: 389, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:44:10,758 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.65 vs. limit=15.0 2023-06-26 12:44:21,971 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1571562.0, ans=0.125 2023-06-26 12:44:37,465 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1571562.0, ans=0.125 2023-06-26 12:45:06,980 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1571682.0, ans=0.125 2023-06-26 12:45:24,828 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.199e+02 4.426e+02 5.727e+02 7.254e+02 1.857e+03, threshold=1.145e+03, percent-clipped=1.0 2023-06-26 12:45:26,479 INFO [train.py:996] (0/4) Epoch 9, batch 18000, loss[loss=0.1915, simple_loss=0.2587, pruned_loss=0.06213, over 21664.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2963, pruned_loss=0.06708, over 4275860.54 frames. ], batch size: 333, lr: 3.27e-03, grad_scale: 32.0 2023-06-26 12:45:26,480 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-26 12:45:46,678 INFO [train.py:1028] (0/4) Epoch 9, validation: loss=0.2587, simple_loss=0.3543, pruned_loss=0.08153, over 1796401.00 frames. 2023-06-26 12:45:46,679 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23714MB 2023-06-26 12:46:18,083 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.66 vs. limit=12.0 2023-06-26 12:46:23,535 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.13 vs. limit=12.0 2023-06-26 12:46:46,841 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=5.67 vs. limit=22.5 2023-06-26 12:47:35,679 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1572042.0, ans=0.125 2023-06-26 12:47:36,579 INFO [train.py:996] (0/4) Epoch 9, batch 18050, loss[loss=0.1768, simple_loss=0.2545, pruned_loss=0.04955, over 21512.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2908, pruned_loss=0.06627, over 4263702.55 frames. ], batch size: 230, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:47:52,476 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.59 vs. limit=10.0 2023-06-26 12:48:29,522 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1572162.0, ans=0.0 2023-06-26 12:48:29,593 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 12:48:51,311 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1572222.0, ans=0.2 2023-06-26 12:48:56,460 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1572222.0, ans=0.125 2023-06-26 12:49:18,855 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.19 vs. limit=15.0 2023-06-26 12:49:28,427 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.634e+02 5.417e+02 6.596e+02 1.071e+03 2.802e+03, threshold=1.319e+03, percent-clipped=21.0 2023-06-26 12:49:28,457 INFO [train.py:996] (0/4) Epoch 9, batch 18100, loss[loss=0.2412, simple_loss=0.3284, pruned_loss=0.07704, over 21829.00 frames. ], tot_loss[loss=0.216, simple_loss=0.2951, pruned_loss=0.06849, over 4270799.47 frames. ], batch size: 118, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:50:51,047 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=1572522.0, ans=0.5 2023-06-26 12:51:03,473 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1572582.0, ans=0.125 2023-06-26 12:51:18,365 INFO [train.py:996] (0/4) Epoch 9, batch 18150, loss[loss=0.1816, simple_loss=0.2553, pruned_loss=0.05398, over 21513.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.297, pruned_loss=0.0683, over 4272143.59 frames. ], batch size: 132, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:51:49,778 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1572702.0, ans=0.1 2023-06-26 12:51:49,797 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1572702.0, ans=0.0 2023-06-26 12:51:54,917 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1572702.0, ans=0.07 2023-06-26 12:52:27,436 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1572822.0, ans=0.0 2023-06-26 12:53:04,892 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1572942.0, ans=0.125 2023-06-26 12:53:05,713 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.440e+02 4.565e+02 5.741e+02 8.756e+02 1.817e+03, threshold=1.148e+03, percent-clipped=4.0 2023-06-26 12:53:05,748 INFO [train.py:996] (0/4) Epoch 9, batch 18200, loss[loss=0.17, simple_loss=0.2528, pruned_loss=0.04363, over 21818.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2915, pruned_loss=0.06825, over 4272174.63 frames. ], batch size: 118, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:53:43,806 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1573062.0, ans=0.125 2023-06-26 12:53:53,694 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1573062.0, ans=0.125 2023-06-26 12:53:57,414 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.60 vs. limit=10.0 2023-06-26 12:54:25,654 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1573122.0, ans=0.125 2023-06-26 12:54:36,228 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1573182.0, ans=0.0 2023-06-26 12:54:46,932 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.28 vs. limit=15.0 2023-06-26 12:54:47,266 INFO [train.py:996] (0/4) Epoch 9, batch 18250, loss[loss=0.166, simple_loss=0.2424, pruned_loss=0.04481, over 21352.00 frames. ], tot_loss[loss=0.2077, simple_loss=0.284, pruned_loss=0.06568, over 4249600.10 frames. ], batch size: 144, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:55:11,694 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1573302.0, ans=0.125 2023-06-26 12:55:41,577 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1573362.0, ans=0.125 2023-06-26 12:56:26,822 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1573482.0, ans=0.1 2023-06-26 12:56:30,901 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.69 vs. limit=22.5 2023-06-26 12:56:42,103 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.384e+02 4.816e+02 6.355e+02 8.859e+02 2.523e+03, threshold=1.271e+03, percent-clipped=14.0 2023-06-26 12:56:42,139 INFO [train.py:996] (0/4) Epoch 9, batch 18300, loss[loss=0.1945, simple_loss=0.2678, pruned_loss=0.06063, over 21348.00 frames. ], tot_loss[loss=0.2068, simple_loss=0.2834, pruned_loss=0.06511, over 4251278.44 frames. ], batch size: 159, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:56:42,730 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1573542.0, ans=0.125 2023-06-26 12:57:25,231 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=1573662.0, ans=15.0 2023-06-26 12:57:34,573 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1573662.0, ans=0.125 2023-06-26 12:58:25,483 INFO [train.py:996] (0/4) Epoch 9, batch 18350, loss[loss=0.2009, simple_loss=0.2803, pruned_loss=0.06077, over 21649.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.291, pruned_loss=0.06576, over 4245168.45 frames. ], batch size: 332, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:59:11,844 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1573962.0, ans=0.1 2023-06-26 12:59:12,538 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.33 vs. limit=15.0 2023-06-26 12:59:46,347 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.60 vs. limit=15.0 2023-06-26 12:59:57,764 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1574082.0, ans=0.125 2023-06-26 13:00:14,628 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.067e+02 5.416e+02 7.058e+02 9.535e+02 2.465e+03, threshold=1.412e+03, percent-clipped=12.0 2023-06-26 13:00:14,668 INFO [train.py:996] (0/4) Epoch 9, batch 18400, loss[loss=0.1983, simple_loss=0.2736, pruned_loss=0.06153, over 21832.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.2876, pruned_loss=0.06546, over 4249974.02 frames. ], batch size: 107, lr: 3.27e-03, grad_scale: 32.0 2023-06-26 13:02:04,303 INFO [train.py:996] (0/4) Epoch 9, batch 18450, loss[loss=0.1611, simple_loss=0.2516, pruned_loss=0.03533, over 21524.00 frames. ], tot_loss[loss=0.2044, simple_loss=0.285, pruned_loss=0.06188, over 4253174.56 frames. ], batch size: 230, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 13:02:10,281 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.05 vs. limit=6.0 2023-06-26 13:02:13,243 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1574442.0, ans=0.0 2023-06-26 13:03:01,123 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.15 vs. limit=22.5 2023-06-26 13:03:22,799 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1574622.0, ans=0.0 2023-06-26 13:03:42,458 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1574682.0, ans=0.125 2023-06-26 13:03:52,184 INFO [train.py:996] (0/4) Epoch 9, batch 18500, loss[loss=0.1805, simple_loss=0.2509, pruned_loss=0.05506, over 21334.00 frames. ], tot_loss[loss=0.2007, simple_loss=0.2797, pruned_loss=0.06083, over 4250391.89 frames. ], batch size: 131, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:03:53,906 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.096e+02 4.756e+02 7.398e+02 1.037e+03 4.377e+03, threshold=1.480e+03, percent-clipped=11.0 2023-06-26 13:03:54,373 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1574742.0, ans=0.125 2023-06-26 13:04:10,594 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1574802.0, ans=0.2 2023-06-26 13:04:37,933 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1574862.0, ans=0.0 2023-06-26 13:05:21,252 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1574982.0, ans=0.0 2023-06-26 13:05:24,622 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1574982.0, ans=0.2 2023-06-26 13:05:40,079 INFO [train.py:996] (0/4) Epoch 9, batch 18550, loss[loss=0.1921, simple_loss=0.2617, pruned_loss=0.0613, over 21761.00 frames. ], tot_loss[loss=0.1991, simple_loss=0.2776, pruned_loss=0.06028, over 4259176.60 frames. ], batch size: 124, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:05:42,460 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1575042.0, ans=0.0 2023-06-26 13:06:11,492 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1575102.0, ans=0.0 2023-06-26 13:06:15,669 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.58 vs. limit=6.0 2023-06-26 13:06:15,759 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.65 vs. limit=15.0 2023-06-26 13:06:20,668 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1575162.0, ans=0.1 2023-06-26 13:06:22,505 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1575162.0, ans=0.07 2023-06-26 13:07:28,405 INFO [train.py:996] (0/4) Epoch 9, batch 18600, loss[loss=0.1841, simple_loss=0.2558, pruned_loss=0.05615, over 21109.00 frames. ], tot_loss[loss=0.198, simple_loss=0.2748, pruned_loss=0.06056, over 4265411.34 frames. ], batch size: 143, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:07:29,274 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1575342.0, ans=0.0 2023-06-26 13:07:30,257 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.128e+02 4.633e+02 7.387e+02 1.048e+03 1.831e+03, threshold=1.477e+03, percent-clipped=1.0 2023-06-26 13:07:39,926 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.57 vs. limit=15.0 2023-06-26 13:08:24,179 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1575462.0, ans=0.04949747468305833 2023-06-26 13:08:30,957 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1575522.0, ans=0.125 2023-06-26 13:09:05,238 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1575582.0, ans=0.125 2023-06-26 13:09:08,708 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1575582.0, ans=0.09899494936611666 2023-06-26 13:09:15,093 INFO [train.py:996] (0/4) Epoch 9, batch 18650, loss[loss=0.1913, simple_loss=0.2641, pruned_loss=0.05928, over 21773.00 frames. ], tot_loss[loss=0.1987, simple_loss=0.2751, pruned_loss=0.06118, over 4273379.33 frames. ], batch size: 352, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:09:17,288 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1575642.0, ans=0.125 2023-06-26 13:09:27,502 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1575642.0, ans=0.09899494936611666 2023-06-26 13:09:38,916 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1575702.0, ans=0.125 2023-06-26 13:10:35,299 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1575822.0, ans=0.0 2023-06-26 13:10:51,583 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.11 vs. limit=22.5 2023-06-26 13:11:02,378 INFO [train.py:996] (0/4) Epoch 9, batch 18700, loss[loss=0.2007, simple_loss=0.2697, pruned_loss=0.06588, over 21865.00 frames. ], tot_loss[loss=0.1996, simple_loss=0.2734, pruned_loss=0.06288, over 4270500.37 frames. ], batch size: 107, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:11:04,042 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.187e+02 4.395e+02 5.926e+02 8.949e+02 1.374e+03, threshold=1.185e+03, percent-clipped=0.0 2023-06-26 13:11:09,834 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1575942.0, ans=0.0 2023-06-26 13:11:46,703 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1576062.0, ans=0.2 2023-06-26 13:12:09,024 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1576122.0, ans=0.125 2023-06-26 13:12:49,692 INFO [train.py:996] (0/4) Epoch 9, batch 18750, loss[loss=0.2014, simple_loss=0.2597, pruned_loss=0.07154, over 21239.00 frames. ], tot_loss[loss=0.2015, simple_loss=0.2741, pruned_loss=0.06445, over 4261865.39 frames. ], batch size: 608, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:12:53,692 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1576242.0, ans=0.125 2023-06-26 13:13:38,589 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1576362.0, ans=0.125 2023-06-26 13:13:40,270 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1576362.0, ans=0.95 2023-06-26 13:14:38,342 INFO [train.py:996] (0/4) Epoch 9, batch 18800, loss[loss=0.2308, simple_loss=0.3165, pruned_loss=0.0725, over 21508.00 frames. ], tot_loss[loss=0.2065, simple_loss=0.2807, pruned_loss=0.06609, over 4253438.95 frames. ], batch size: 471, lr: 3.26e-03, grad_scale: 32.0 2023-06-26 13:14:40,098 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.559e+02 6.038e+02 7.723e+02 1.097e+03 3.023e+03, threshold=1.545e+03, percent-clipped=19.0 2023-06-26 13:15:24,841 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1576662.0, ans=0.1 2023-06-26 13:15:30,860 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.20 vs. limit=15.0 2023-06-26 13:15:37,806 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1576662.0, ans=0.0 2023-06-26 13:16:08,921 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1576782.0, ans=0.125 2023-06-26 13:16:15,172 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.47 vs. limit=15.0 2023-06-26 13:16:17,834 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1576782.0, ans=0.0 2023-06-26 13:16:27,803 INFO [train.py:996] (0/4) Epoch 9, batch 18850, loss[loss=0.1609, simple_loss=0.2511, pruned_loss=0.03533, over 21639.00 frames. ], tot_loss[loss=0.2019, simple_loss=0.2786, pruned_loss=0.06256, over 4259610.97 frames. ], batch size: 263, lr: 3.26e-03, grad_scale: 32.0 2023-06-26 13:16:33,813 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.75 vs. limit=15.0 2023-06-26 13:16:36,699 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1576842.0, ans=0.1 2023-06-26 13:16:36,767 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1576842.0, ans=0.0 2023-06-26 13:16:43,801 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1576902.0, ans=10.0 2023-06-26 13:16:50,609 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1576902.0, ans=0.2 2023-06-26 13:18:06,148 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1577082.0, ans=0.0 2023-06-26 13:18:14,393 INFO [train.py:996] (0/4) Epoch 9, batch 18900, loss[loss=0.1914, simple_loss=0.2631, pruned_loss=0.05983, over 21822.00 frames. ], tot_loss[loss=0.1986, simple_loss=0.2744, pruned_loss=0.0614, over 4260818.49 frames. ], batch size: 371, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:18:17,656 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.228e+02 4.531e+02 6.963e+02 9.490e+02 1.932e+03, threshold=1.393e+03, percent-clipped=3.0 2023-06-26 13:18:57,078 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1577262.0, ans=0.2 2023-06-26 13:19:06,979 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1577262.0, ans=0.125 2023-06-26 13:20:03,854 INFO [train.py:996] (0/4) Epoch 9, batch 18950, loss[loss=0.2287, simple_loss=0.3114, pruned_loss=0.07296, over 21800.00 frames. ], tot_loss[loss=0.2006, simple_loss=0.2744, pruned_loss=0.06342, over 4269634.77 frames. ], batch size: 414, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:20:47,500 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1577562.0, ans=0.1 2023-06-26 13:21:26,985 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1577622.0, ans=10.0 2023-06-26 13:21:33,053 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.46 vs. limit=22.5 2023-06-26 13:21:54,028 INFO [train.py:996] (0/4) Epoch 9, batch 19000, loss[loss=0.2161, simple_loss=0.2759, pruned_loss=0.07821, over 20188.00 frames. ], tot_loss[loss=0.2075, simple_loss=0.2845, pruned_loss=0.06525, over 4261676.21 frames. ], batch size: 703, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:21:58,089 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.501e+02 4.865e+02 6.670e+02 8.887e+02 1.787e+03, threshold=1.334e+03, percent-clipped=6.0 2023-06-26 13:22:03,553 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1577742.0, ans=0.1 2023-06-26 13:23:20,168 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1577982.0, ans=0.125 2023-06-26 13:23:34,292 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=1577982.0, ans=0.5 2023-06-26 13:23:37,521 INFO [train.py:996] (0/4) Epoch 9, batch 19050, loss[loss=0.2268, simple_loss=0.3035, pruned_loss=0.07503, over 21842.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2906, pruned_loss=0.06807, over 4270223.97 frames. ], batch size: 124, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:23:44,626 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1578042.0, ans=0.2 2023-06-26 13:23:57,067 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1578042.0, ans=0.0 2023-06-26 13:24:05,435 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1578102.0, ans=0.04949747468305833 2023-06-26 13:24:17,641 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1578162.0, ans=0.0 2023-06-26 13:24:21,193 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1578162.0, ans=0.0 2023-06-26 13:24:38,352 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1578222.0, ans=0.2 2023-06-26 13:24:40,776 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.95 vs. limit=12.0 2023-06-26 13:25:17,640 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1578282.0, ans=0.1 2023-06-26 13:25:20,507 INFO [train.py:996] (0/4) Epoch 9, batch 19100, loss[loss=0.2223, simple_loss=0.2806, pruned_loss=0.08199, over 21576.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2897, pruned_loss=0.06955, over 4265768.35 frames. ], batch size: 414, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:25:24,158 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.781e+02 5.304e+02 7.054e+02 1.099e+03 1.877e+03, threshold=1.411e+03, percent-clipped=10.0 2023-06-26 13:26:01,282 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1578462.0, ans=0.125 2023-06-26 13:26:38,541 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1578522.0, ans=0.07 2023-06-26 13:26:38,568 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1578522.0, ans=0.0 2023-06-26 13:27:11,386 INFO [train.py:996] (0/4) Epoch 9, batch 19150, loss[loss=0.3048, simple_loss=0.39, pruned_loss=0.1098, over 21485.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2908, pruned_loss=0.07008, over 4269518.93 frames. ], batch size: 471, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:28:25,473 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1578822.0, ans=0.125 2023-06-26 13:28:30,621 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1578822.0, ans=0.0 2023-06-26 13:29:06,067 INFO [train.py:996] (0/4) Epoch 9, batch 19200, loss[loss=0.2491, simple_loss=0.3591, pruned_loss=0.06953, over 21823.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.3022, pruned_loss=0.07114, over 4276191.15 frames. ], batch size: 371, lr: 3.26e-03, grad_scale: 32.0 2023-06-26 13:29:10,047 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.893e+02 6.153e+02 9.835e+02 1.321e+03 2.570e+03, threshold=1.967e+03, percent-clipped=19.0 2023-06-26 13:29:19,620 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1578942.0, ans=0.0 2023-06-26 13:30:05,945 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1579062.0, ans=0.125 2023-06-26 13:30:24,864 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1579122.0, ans=0.0 2023-06-26 13:30:49,830 INFO [train.py:996] (0/4) Epoch 9, batch 19250, loss[loss=0.2225, simple_loss=0.3324, pruned_loss=0.05626, over 19808.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.3026, pruned_loss=0.06666, over 4261399.54 frames. ], batch size: 703, lr: 3.26e-03, grad_scale: 32.0 2023-06-26 13:31:06,092 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1579302.0, ans=0.125 2023-06-26 13:31:42,119 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1579362.0, ans=0.125 2023-06-26 13:32:05,157 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1579422.0, ans=0.0 2023-06-26 13:32:17,926 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1579482.0, ans=0.125 2023-06-26 13:32:19,685 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1579482.0, ans=0.0 2023-06-26 13:32:21,144 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1579482.0, ans=0.125 2023-06-26 13:32:26,973 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=1579482.0, ans=15.0 2023-06-26 13:32:38,025 INFO [train.py:996] (0/4) Epoch 9, batch 19300, loss[loss=0.1948, simple_loss=0.2727, pruned_loss=0.05848, over 21625.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2999, pruned_loss=0.06571, over 4268014.15 frames. ], batch size: 263, lr: 3.26e-03, grad_scale: 32.0 2023-06-26 13:32:41,548 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.842e+02 4.708e+02 6.632e+02 9.817e+02 2.132e+03, threshold=1.326e+03, percent-clipped=1.0 2023-06-26 13:32:43,923 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1579542.0, ans=0.125 2023-06-26 13:32:51,293 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1579542.0, ans=0.125 2023-06-26 13:33:23,247 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1579662.0, ans=0.0 2023-06-26 13:33:50,287 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1579722.0, ans=0.125 2023-06-26 13:34:22,107 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1579842.0, ans=0.0 2023-06-26 13:34:23,285 INFO [train.py:996] (0/4) Epoch 9, batch 19350, loss[loss=0.1635, simple_loss=0.2394, pruned_loss=0.04379, over 21149.00 frames. ], tot_loss[loss=0.2097, simple_loss=0.2943, pruned_loss=0.06258, over 4273017.60 frames. ], batch size: 143, lr: 3.26e-03, grad_scale: 32.0 2023-06-26 13:34:27,205 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1579842.0, ans=0.07 2023-06-26 13:34:32,410 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1579842.0, ans=0.0 2023-06-26 13:35:14,603 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1579962.0, ans=0.1 2023-06-26 13:35:49,312 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1580082.0, ans=0.09899494936611666 2023-06-26 13:36:04,319 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1580082.0, ans=0.0 2023-06-26 13:36:10,344 INFO [train.py:996] (0/4) Epoch 9, batch 19400, loss[loss=0.2093, simple_loss=0.2854, pruned_loss=0.06658, over 21859.00 frames. ], tot_loss[loss=0.2074, simple_loss=0.2908, pruned_loss=0.06201, over 4280616.20 frames. ], batch size: 351, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:36:11,576 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.67 vs. limit=22.5 2023-06-26 13:36:15,960 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.056e+02 5.043e+02 7.685e+02 1.074e+03 1.940e+03, threshold=1.537e+03, percent-clipped=16.0 2023-06-26 13:37:03,729 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1580262.0, ans=0.1 2023-06-26 13:37:12,159 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1580322.0, ans=0.125 2023-06-26 13:37:24,765 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1580322.0, ans=0.0 2023-06-26 13:37:27,974 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1580322.0, ans=0.07 2023-06-26 13:37:31,321 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1580382.0, ans=0.125 2023-06-26 13:37:53,596 INFO [train.py:996] (0/4) Epoch 9, batch 19450, loss[loss=0.2159, simple_loss=0.2763, pruned_loss=0.07773, over 21897.00 frames. ], tot_loss[loss=0.208, simple_loss=0.2882, pruned_loss=0.06393, over 4285662.14 frames. ], batch size: 373, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:38:45,823 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1580562.0, ans=0.125 2023-06-26 13:39:10,963 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.01 vs. limit=12.0 2023-06-26 13:39:41,565 INFO [train.py:996] (0/4) Epoch 9, batch 19500, loss[loss=0.1857, simple_loss=0.2637, pruned_loss=0.0539, over 21771.00 frames. ], tot_loss[loss=0.2064, simple_loss=0.2843, pruned_loss=0.06424, over 4285574.88 frames. ], batch size: 282, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:39:46,902 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.457e+02 4.487e+02 6.079e+02 9.287e+02 2.149e+03, threshold=1.216e+03, percent-clipped=7.0 2023-06-26 13:39:58,139 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1580742.0, ans=0.125 2023-06-26 13:40:01,419 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1580742.0, ans=0.1 2023-06-26 13:41:31,315 INFO [train.py:996] (0/4) Epoch 9, batch 19550, loss[loss=0.1907, simple_loss=0.2752, pruned_loss=0.05315, over 20806.00 frames. ], tot_loss[loss=0.2045, simple_loss=0.2815, pruned_loss=0.06372, over 4280666.37 frames. ], batch size: 607, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:42:28,644 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1581162.0, ans=0.125 2023-06-26 13:42:38,086 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.22 vs. limit=12.0 2023-06-26 13:42:48,151 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1581222.0, ans=0.125 2023-06-26 13:43:10,739 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1581282.0, ans=0.125 2023-06-26 13:43:18,849 INFO [train.py:996] (0/4) Epoch 9, batch 19600, loss[loss=0.246, simple_loss=0.2977, pruned_loss=0.09717, over 21779.00 frames. ], tot_loss[loss=0.2061, simple_loss=0.2831, pruned_loss=0.06455, over 4289431.35 frames. ], batch size: 508, lr: 3.26e-03, grad_scale: 32.0 2023-06-26 13:43:29,301 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.268e+02 5.045e+02 6.281e+02 9.154e+02 2.396e+03, threshold=1.256e+03, percent-clipped=14.0 2023-06-26 13:43:32,158 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.72 vs. limit=15.0 2023-06-26 13:43:56,600 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1581402.0, ans=0.125 2023-06-26 13:44:12,291 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1581462.0, ans=0.125 2023-06-26 13:44:18,183 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=17.35 vs. limit=22.5 2023-06-26 13:44:21,099 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1581462.0, ans=0.125 2023-06-26 13:44:37,455 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.26 vs. limit=22.5 2023-06-26 13:44:47,144 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1581522.0, ans=0.0 2023-06-26 13:44:47,869 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.87 vs. limit=15.0 2023-06-26 13:44:59,258 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1581582.0, ans=0.2 2023-06-26 13:45:13,612 INFO [train.py:996] (0/4) Epoch 9, batch 19650, loss[loss=0.2178, simple_loss=0.2893, pruned_loss=0.07313, over 21864.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2885, pruned_loss=0.06832, over 4287010.38 frames. ], batch size: 371, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:46:02,402 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1581762.0, ans=0.125 2023-06-26 13:46:11,177 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1581762.0, ans=0.125 2023-06-26 13:46:20,152 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1581762.0, ans=0.125 2023-06-26 13:46:24,138 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.68 vs. limit=15.0 2023-06-26 13:46:34,899 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1581822.0, ans=0.0 2023-06-26 13:46:36,772 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1581822.0, ans=0.125 2023-06-26 13:47:14,779 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1581942.0, ans=0.125 2023-06-26 13:47:15,524 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.02 vs. limit=15.0 2023-06-26 13:47:15,818 INFO [train.py:996] (0/4) Epoch 9, batch 19700, loss[loss=0.232, simple_loss=0.3273, pruned_loss=0.06834, over 21603.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.2898, pruned_loss=0.06823, over 4286974.11 frames. ], batch size: 441, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:47:16,550 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1581942.0, ans=0.125 2023-06-26 13:47:22,846 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.461e+02 6.188e+02 8.447e+02 1.401e+03 2.428e+03, threshold=1.689e+03, percent-clipped=28.0 2023-06-26 13:47:29,916 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.23 vs. limit=15.0 2023-06-26 13:47:30,893 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1581942.0, ans=0.1 2023-06-26 13:48:01,110 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.46 vs. limit=15.0 2023-06-26 13:48:07,492 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=1582062.0, ans=22.5 2023-06-26 13:48:09,601 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.07 vs. limit=22.5 2023-06-26 13:48:48,288 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1582182.0, ans=0.1 2023-06-26 13:49:06,316 INFO [train.py:996] (0/4) Epoch 9, batch 19750, loss[loss=0.2261, simple_loss=0.3091, pruned_loss=0.07157, over 21463.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.2992, pruned_loss=0.06977, over 4281070.50 frames. ], batch size: 194, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:49:21,570 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.74 vs. limit=22.5 2023-06-26 13:49:31,236 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1582302.0, ans=0.125 2023-06-26 13:50:21,803 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1582422.0, ans=0.0 2023-06-26 13:50:55,598 INFO [train.py:996] (0/4) Epoch 9, batch 19800, loss[loss=0.1954, simple_loss=0.2834, pruned_loss=0.05376, over 21422.00 frames. ], tot_loss[loss=0.22, simple_loss=0.2994, pruned_loss=0.07035, over 4285005.81 frames. ], batch size: 548, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:51:02,877 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.312e+02 6.209e+02 8.156e+02 1.271e+03 2.290e+03, threshold=1.631e+03, percent-clipped=8.0 2023-06-26 13:51:24,308 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1582602.0, ans=0.125 2023-06-26 13:51:30,023 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.82 vs. limit=12.0 2023-06-26 13:52:00,195 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1582722.0, ans=0.125 2023-06-26 13:52:25,781 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 13:52:40,912 INFO [train.py:996] (0/4) Epoch 9, batch 19850, loss[loss=0.2353, simple_loss=0.3237, pruned_loss=0.07348, over 21624.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2929, pruned_loss=0.06634, over 4282135.66 frames. ], batch size: 441, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:52:42,081 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.11 vs. limit=22.5 2023-06-26 13:53:42,336 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.52 vs. limit=15.0 2023-06-26 13:54:27,769 INFO [train.py:996] (0/4) Epoch 9, batch 19900, loss[loss=0.1875, simple_loss=0.2662, pruned_loss=0.05436, over 21763.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.2938, pruned_loss=0.06364, over 4277297.71 frames. ], batch size: 124, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:54:32,090 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=1583142.0, ans=0.5 2023-06-26 13:54:34,750 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.163e+02 4.779e+02 6.020e+02 7.987e+02 2.016e+03, threshold=1.204e+03, percent-clipped=5.0 2023-06-26 13:54:46,230 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1583142.0, ans=0.125 2023-06-26 13:54:59,086 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.67 vs. limit=12.0 2023-06-26 13:55:18,246 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1583262.0, ans=0.0 2023-06-26 13:55:56,665 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1583322.0, ans=0.0 2023-06-26 13:56:08,457 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1583382.0, ans=0.2 2023-06-26 13:56:10,802 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1583382.0, ans=0.125 2023-06-26 13:56:18,964 INFO [train.py:996] (0/4) Epoch 9, batch 19950, loss[loss=0.1963, simple_loss=0.2556, pruned_loss=0.06848, over 21429.00 frames. ], tot_loss[loss=0.2076, simple_loss=0.2883, pruned_loss=0.06349, over 4273806.51 frames. ], batch size: 195, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:57:24,896 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1583562.0, ans=0.2 2023-06-26 13:57:38,818 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.49 vs. limit=15.0 2023-06-26 13:57:56,236 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.49 vs. limit=12.0 2023-06-26 13:58:06,824 INFO [train.py:996] (0/4) Epoch 9, batch 20000, loss[loss=0.2278, simple_loss=0.3023, pruned_loss=0.0767, over 21857.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.2895, pruned_loss=0.06457, over 4276316.28 frames. ], batch size: 351, lr: 3.26e-03, grad_scale: 32.0 2023-06-26 13:58:19,159 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.477e+02 4.524e+02 6.104e+02 8.785e+02 2.084e+03, threshold=1.221e+03, percent-clipped=7.0 2023-06-26 13:58:26,694 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1583742.0, ans=0.125 2023-06-26 13:59:07,111 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1583862.0, ans=0.125 2023-06-26 13:59:41,099 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-264000.pt 2023-06-26 13:59:56,239 INFO [train.py:996] (0/4) Epoch 9, batch 20050, loss[loss=0.2347, simple_loss=0.3018, pruned_loss=0.08379, over 21616.00 frames. ], tot_loss[loss=0.2121, simple_loss=0.2911, pruned_loss=0.06655, over 4278218.87 frames. ], batch size: 471, lr: 3.26e-03, grad_scale: 32.0 2023-06-26 14:01:00,137 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1584162.0, ans=0.125 2023-06-26 14:01:51,599 INFO [train.py:996] (0/4) Epoch 9, batch 20100, loss[loss=0.2228, simple_loss=0.3261, pruned_loss=0.05973, over 21804.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.294, pruned_loss=0.06922, over 4281646.40 frames. ], batch size: 282, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 14:01:53,973 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1584342.0, ans=0.1 2023-06-26 14:02:00,511 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.881e+02 4.985e+02 7.812e+02 1.091e+03 2.146e+03, threshold=1.562e+03, percent-clipped=15.0 2023-06-26 14:03:41,816 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1584642.0, ans=0.0 2023-06-26 14:03:43,029 INFO [train.py:996] (0/4) Epoch 9, batch 20150, loss[loss=0.2553, simple_loss=0.3329, pruned_loss=0.08887, over 21410.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.3011, pruned_loss=0.07195, over 4283032.29 frames. ], batch size: 159, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:04:29,334 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1584702.0, ans=0.125 2023-06-26 14:05:53,468 INFO [train.py:996] (0/4) Epoch 9, batch 20200, loss[loss=0.2558, simple_loss=0.3599, pruned_loss=0.07585, over 20741.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3077, pruned_loss=0.0749, over 4273171.25 frames. ], batch size: 607, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:06:02,521 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.033e+02 6.140e+02 1.031e+03 1.445e+03 3.124e+03, threshold=2.061e+03, percent-clipped=23.0 2023-06-26 14:07:05,192 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1585122.0, ans=0.1 2023-06-26 14:07:43,933 INFO [train.py:996] (0/4) Epoch 9, batch 20250, loss[loss=0.2108, simple_loss=0.2994, pruned_loss=0.06114, over 21795.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3089, pruned_loss=0.07329, over 4271923.70 frames. ], batch size: 332, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:07:52,984 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1585242.0, ans=0.1 2023-06-26 14:08:02,073 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1585302.0, ans=0.125 2023-06-26 14:09:13,702 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 14:09:18,945 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1585482.0, ans=0.2 2023-06-26 14:09:26,864 INFO [train.py:996] (0/4) Epoch 9, batch 20300, loss[loss=0.1853, simple_loss=0.2748, pruned_loss=0.04789, over 21422.00 frames. ], tot_loss[loss=0.223, simple_loss=0.3056, pruned_loss=0.0702, over 4274894.02 frames. ], batch size: 211, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:09:35,561 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.339e+02 4.853e+02 6.521e+02 1.002e+03 2.689e+03, threshold=1.304e+03, percent-clipped=1.0 2023-06-26 14:09:44,916 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1585602.0, ans=0.07 2023-06-26 14:10:00,748 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.46 vs. limit=15.0 2023-06-26 14:10:25,510 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=1585722.0, ans=15.0 2023-06-26 14:10:49,544 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1585782.0, ans=0.07 2023-06-26 14:11:15,957 INFO [train.py:996] (0/4) Epoch 9, batch 20350, loss[loss=0.2439, simple_loss=0.316, pruned_loss=0.08591, over 21801.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.3053, pruned_loss=0.0701, over 4259272.37 frames. ], batch size: 332, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:11:22,376 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1585842.0, ans=0.0 2023-06-26 14:11:29,253 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1585842.0, ans=0.125 2023-06-26 14:13:04,280 INFO [train.py:996] (0/4) Epoch 9, batch 20400, loss[loss=0.2584, simple_loss=0.3352, pruned_loss=0.09082, over 21366.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.307, pruned_loss=0.07222, over 4260921.98 frames. ], batch size: 548, lr: 3.25e-03, grad_scale: 32.0 2023-06-26 14:13:13,312 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.291e+02 5.756e+02 8.261e+02 1.227e+03 2.104e+03, threshold=1.652e+03, percent-clipped=22.0 2023-06-26 14:13:21,197 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1586202.0, ans=0.125 2023-06-26 14:13:22,743 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1586202.0, ans=0.125 2023-06-26 14:14:05,710 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1586322.0, ans=0.0 2023-06-26 14:14:05,754 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1586322.0, ans=0.125 2023-06-26 14:14:52,322 INFO [train.py:996] (0/4) Epoch 9, batch 20450, loss[loss=0.2558, simple_loss=0.3187, pruned_loss=0.09648, over 21553.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3081, pruned_loss=0.07516, over 4254599.17 frames. ], batch size: 471, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:15:23,506 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1586502.0, ans=0.0 2023-06-26 14:15:45,629 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1586562.0, ans=0.0 2023-06-26 14:16:13,714 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1586682.0, ans=0.125 2023-06-26 14:16:30,935 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1586682.0, ans=0.125 2023-06-26 14:16:33,727 INFO [train.py:996] (0/4) Epoch 9, batch 20500, loss[loss=0.2343, simple_loss=0.2999, pruned_loss=0.08437, over 21680.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.304, pruned_loss=0.07513, over 4256329.53 frames. ], batch size: 414, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:16:44,020 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.954e+02 5.491e+02 7.367e+02 1.069e+03 2.836e+03, threshold=1.473e+03, percent-clipped=8.0 2023-06-26 14:16:55,284 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 14:17:22,719 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 14:18:03,286 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1586982.0, ans=0.125 2023-06-26 14:18:08,763 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1586982.0, ans=0.0 2023-06-26 14:18:10,149 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1586982.0, ans=0.125 2023-06-26 14:18:21,712 INFO [train.py:996] (0/4) Epoch 9, batch 20550, loss[loss=0.1898, simple_loss=0.27, pruned_loss=0.05486, over 21136.00 frames. ], tot_loss[loss=0.2219, simple_loss=0.2974, pruned_loss=0.07315, over 4244191.64 frames. ], batch size: 143, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:18:22,218 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1587042.0, ans=0.1 2023-06-26 14:18:25,987 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.42 vs. limit=12.0 2023-06-26 14:19:04,228 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1587162.0, ans=0.125 2023-06-26 14:19:08,258 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.67 vs. limit=22.5 2023-06-26 14:19:11,281 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1587162.0, ans=0.125 2023-06-26 14:19:17,046 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.35 vs. limit=15.0 2023-06-26 14:19:25,652 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.34 vs. limit=22.5 2023-06-26 14:20:09,351 INFO [train.py:996] (0/4) Epoch 9, batch 20600, loss[loss=0.2295, simple_loss=0.3012, pruned_loss=0.07894, over 21833.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.2987, pruned_loss=0.0712, over 4236028.25 frames. ], batch size: 282, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:20:17,145 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1587342.0, ans=0.125 2023-06-26 14:20:19,843 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.630e+02 4.988e+02 6.640e+02 9.393e+02 1.385e+03, threshold=1.328e+03, percent-clipped=0.0 2023-06-26 14:20:42,408 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1587402.0, ans=0.125 2023-06-26 14:20:48,037 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1587462.0, ans=0.035 2023-06-26 14:20:51,320 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1587462.0, ans=0.125 2023-06-26 14:21:07,370 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.43 vs. limit=6.0 2023-06-26 14:21:41,960 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1587582.0, ans=0.0 2023-06-26 14:21:56,975 INFO [train.py:996] (0/4) Epoch 9, batch 20650, loss[loss=0.1906, simple_loss=0.2651, pruned_loss=0.05801, over 21683.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.2952, pruned_loss=0.07123, over 4237274.16 frames. ], batch size: 332, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:23:47,421 INFO [train.py:996] (0/4) Epoch 9, batch 20700, loss[loss=0.1742, simple_loss=0.2663, pruned_loss=0.04105, over 21716.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.287, pruned_loss=0.06805, over 4236844.44 frames. ], batch size: 332, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:23:58,486 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.316e+02 4.993e+02 7.910e+02 1.068e+03 1.993e+03, threshold=1.582e+03, percent-clipped=12.0 2023-06-26 14:24:04,374 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1588002.0, ans=0.125 2023-06-26 14:24:11,147 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1588002.0, ans=0.2 2023-06-26 14:24:27,289 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 14:24:55,729 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.23 vs. limit=15.0 2023-06-26 14:25:38,432 INFO [train.py:996] (0/4) Epoch 9, batch 20750, loss[loss=0.286, simple_loss=0.3725, pruned_loss=0.09971, over 21807.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.2905, pruned_loss=0.06785, over 4250428.88 frames. ], batch size: 371, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:25:51,462 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1588242.0, ans=0.0 2023-06-26 14:26:10,615 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1588302.0, ans=0.1 2023-06-26 14:26:17,828 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1588362.0, ans=0.015 2023-06-26 14:26:47,237 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1588422.0, ans=0.2 2023-06-26 14:26:50,711 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1588422.0, ans=0.125 2023-06-26 14:27:17,935 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1588482.0, ans=0.04949747468305833 2023-06-26 14:27:32,210 INFO [train.py:996] (0/4) Epoch 9, batch 20800, loss[loss=0.1997, simple_loss=0.2665, pruned_loss=0.06646, over 21720.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2937, pruned_loss=0.06851, over 4254661.05 frames. ], batch size: 316, lr: 3.25e-03, grad_scale: 32.0 2023-06-26 14:27:42,704 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.373e+02 6.315e+02 8.167e+02 1.529e+03 3.332e+03, threshold=1.633e+03, percent-clipped=23.0 2023-06-26 14:27:45,207 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1588542.0, ans=0.125 2023-06-26 14:29:12,104 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.02 vs. limit=10.0 2023-06-26 14:29:19,950 INFO [train.py:996] (0/4) Epoch 9, batch 20850, loss[loss=0.1865, simple_loss=0.2699, pruned_loss=0.05155, over 21827.00 frames. ], tot_loss[loss=0.2088, simple_loss=0.2852, pruned_loss=0.06619, over 4261883.34 frames. ], batch size: 351, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:29:20,424 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1588842.0, ans=0.0 2023-06-26 14:29:47,017 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1588902.0, ans=0.0 2023-06-26 14:30:11,746 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1588962.0, ans=0.04949747468305833 2023-06-26 14:30:22,806 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1589022.0, ans=0.125 2023-06-26 14:30:33,855 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1589022.0, ans=0.07 2023-06-26 14:30:42,173 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1589022.0, ans=0.125 2023-06-26 14:31:08,538 INFO [train.py:996] (0/4) Epoch 9, batch 20900, loss[loss=0.2816, simple_loss=0.3419, pruned_loss=0.1106, over 21646.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2866, pruned_loss=0.06779, over 4270750.92 frames. ], batch size: 508, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:31:17,815 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1589142.0, ans=0.5 2023-06-26 14:31:19,218 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1589142.0, ans=0.0 2023-06-26 14:31:20,491 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.215e+02 4.594e+02 6.029e+02 1.010e+03 2.105e+03, threshold=1.206e+03, percent-clipped=4.0 2023-06-26 14:31:21,617 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.25 vs. limit=22.5 2023-06-26 14:32:29,091 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1589322.0, ans=0.125 2023-06-26 14:32:34,009 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1589382.0, ans=0.125 2023-06-26 14:32:48,500 INFO [train.py:996] (0/4) Epoch 9, batch 20950, loss[loss=0.1888, simple_loss=0.2659, pruned_loss=0.05584, over 21855.00 frames. ], tot_loss[loss=0.2057, simple_loss=0.2825, pruned_loss=0.06444, over 4261395.41 frames. ], batch size: 102, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:33:21,840 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1589502.0, ans=0.0 2023-06-26 14:34:23,176 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1589682.0, ans=10.0 2023-06-26 14:34:36,221 INFO [train.py:996] (0/4) Epoch 9, batch 21000, loss[loss=0.1512, simple_loss=0.2211, pruned_loss=0.04066, over 15656.00 frames. ], tot_loss[loss=0.2059, simple_loss=0.2819, pruned_loss=0.06496, over 4271236.44 frames. ], batch size: 60, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:34:36,222 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-26 14:34:50,669 INFO [zipformer.py:1728] (0/4) name=encoder.encoders.0.layers.1.self_attn_weights, attn_weights_entropy = tensor([5.2561, 4.7775, 5.0039, 4.3957], device='cuda:0') 2023-06-26 14:34:59,718 INFO [train.py:1028] (0/4) Epoch 9, validation: loss=0.2612, simple_loss=0.3587, pruned_loss=0.0819, over 1796401.00 frames. 2023-06-26 14:34:59,720 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23714MB 2023-06-26 14:35:02,918 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1589742.0, ans=0.125 2023-06-26 14:35:09,128 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1589742.0, ans=0.125 2023-06-26 14:35:11,946 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.378e+02 4.892e+02 7.035e+02 1.069e+03 1.759e+03, threshold=1.407e+03, percent-clipped=17.0 2023-06-26 14:35:12,818 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1589742.0, ans=0.125 2023-06-26 14:36:06,262 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1589922.0, ans=0.07 2023-06-26 14:36:19,018 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1589922.0, ans=0.0 2023-06-26 14:36:49,898 INFO [train.py:996] (0/4) Epoch 9, batch 21050, loss[loss=0.1861, simple_loss=0.2508, pruned_loss=0.06075, over 21525.00 frames. ], tot_loss[loss=0.2053, simple_loss=0.2804, pruned_loss=0.06509, over 4275664.19 frames. ], batch size: 230, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:37:34,890 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1590162.0, ans=0.125 2023-06-26 14:38:36,807 INFO [train.py:996] (0/4) Epoch 9, batch 21100, loss[loss=0.2176, simple_loss=0.2845, pruned_loss=0.07534, over 21527.00 frames. ], tot_loss[loss=0.2034, simple_loss=0.2772, pruned_loss=0.06483, over 4264737.22 frames. ], batch size: 414, lr: 3.25e-03, grad_scale: 8.0 2023-06-26 14:38:50,943 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.682e+02 5.080e+02 7.538e+02 1.007e+03 2.026e+03, threshold=1.508e+03, percent-clipped=9.0 2023-06-26 14:39:05,080 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.10 vs. limit=15.0 2023-06-26 14:39:30,516 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1590462.0, ans=0.125 2023-06-26 14:39:49,453 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1590522.0, ans=0.035 2023-06-26 14:39:51,732 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.37 vs. limit=6.0 2023-06-26 14:39:59,182 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.72 vs. limit=22.5 2023-06-26 14:40:25,041 INFO [train.py:996] (0/4) Epoch 9, batch 21150, loss[loss=0.1956, simple_loss=0.2629, pruned_loss=0.06416, over 21669.00 frames. ], tot_loss[loss=0.2021, simple_loss=0.2738, pruned_loss=0.06519, over 4265583.11 frames. ], batch size: 333, lr: 3.25e-03, grad_scale: 8.0 2023-06-26 14:40:45,676 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1590702.0, ans=0.0 2023-06-26 14:41:10,970 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.47 vs. limit=15.0 2023-06-26 14:41:12,086 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1590762.0, ans=0.2 2023-06-26 14:41:15,287 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1590762.0, ans=0.125 2023-06-26 14:41:24,642 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.24 vs. limit=10.0 2023-06-26 14:41:45,995 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1590822.0, ans=0.0 2023-06-26 14:41:47,386 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1590822.0, ans=0.0 2023-06-26 14:41:48,010 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.43 vs. limit=15.0 2023-06-26 14:42:12,173 INFO [train.py:996] (0/4) Epoch 9, batch 21200, loss[loss=0.2435, simple_loss=0.2859, pruned_loss=0.1005, over 21359.00 frames. ], tot_loss[loss=0.2004, simple_loss=0.2704, pruned_loss=0.06522, over 4252100.45 frames. ], batch size: 508, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:42:26,038 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.202e+02 4.962e+02 6.952e+02 8.758e+02 1.783e+03, threshold=1.390e+03, percent-clipped=2.0 2023-06-26 14:42:30,451 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1591002.0, ans=0.0 2023-06-26 14:42:35,504 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1591002.0, ans=0.1 2023-06-26 14:43:07,238 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1591062.0, ans=0.1 2023-06-26 14:43:29,043 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1591122.0, ans=0.125 2023-06-26 14:43:30,906 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1591122.0, ans=0.125 2023-06-26 14:43:43,440 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1591182.0, ans=0.0 2023-06-26 14:43:56,762 INFO [train.py:996] (0/4) Epoch 9, batch 21250, loss[loss=0.2757, simple_loss=0.3249, pruned_loss=0.1132, over 21464.00 frames. ], tot_loss[loss=0.1996, simple_loss=0.2687, pruned_loss=0.06523, over 4253971.53 frames. ], batch size: 509, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:44:03,062 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.43 vs. limit=22.5 2023-06-26 14:44:52,573 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1591422.0, ans=0.04949747468305833 2023-06-26 14:45:04,952 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1591422.0, ans=0.125 2023-06-26 14:45:33,936 INFO [train.py:996] (0/4) Epoch 9, batch 21300, loss[loss=0.242, simple_loss=0.314, pruned_loss=0.085, over 21932.00 frames. ], tot_loss[loss=0.2053, simple_loss=0.276, pruned_loss=0.06736, over 4253154.79 frames. ], batch size: 415, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:45:52,771 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.243e+02 5.570e+02 8.003e+02 1.129e+03 3.066e+03, threshold=1.601e+03, percent-clipped=15.0 2023-06-26 14:45:56,893 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1591602.0, ans=0.125 2023-06-26 14:46:28,658 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1591662.0, ans=0.0 2023-06-26 14:47:23,335 INFO [train.py:996] (0/4) Epoch 9, batch 21350, loss[loss=0.1953, simple_loss=0.3011, pruned_loss=0.04472, over 19749.00 frames. ], tot_loss[loss=0.208, simple_loss=0.2805, pruned_loss=0.06775, over 4256303.52 frames. ], batch size: 703, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:47:52,126 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1591902.0, ans=10.0 2023-06-26 14:48:13,264 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1591962.0, ans=0.1 2023-06-26 14:48:28,854 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1591962.0, ans=0.05 2023-06-26 14:48:37,835 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1592022.0, ans=0.2 2023-06-26 14:49:12,007 INFO [train.py:996] (0/4) Epoch 9, batch 21400, loss[loss=0.2379, simple_loss=0.3135, pruned_loss=0.08118, over 21319.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.2836, pruned_loss=0.06751, over 4254082.00 frames. ], batch size: 176, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:49:12,956 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1592142.0, ans=0.125 2023-06-26 14:49:26,000 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.278e+02 4.706e+02 6.583e+02 9.880e+02 2.077e+03, threshold=1.317e+03, percent-clipped=4.0 2023-06-26 14:49:35,503 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1592202.0, ans=0.125 2023-06-26 14:50:43,123 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1592382.0, ans=0.0 2023-06-26 14:50:59,560 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1592442.0, ans=0.125 2023-06-26 14:51:00,499 INFO [train.py:996] (0/4) Epoch 9, batch 21450, loss[loss=0.2219, simple_loss=0.2925, pruned_loss=0.07565, over 21482.00 frames. ], tot_loss[loss=0.2123, simple_loss=0.2872, pruned_loss=0.06867, over 4261875.99 frames. ], batch size: 548, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:51:50,694 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1592562.0, ans=0.125 2023-06-26 14:52:27,035 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1592682.0, ans=0.125 2023-06-26 14:52:41,020 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1592682.0, ans=0.04949747468305833 2023-06-26 14:52:43,808 INFO [train.py:996] (0/4) Epoch 9, batch 21500, loss[loss=0.1926, simple_loss=0.2613, pruned_loss=0.06197, over 21729.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.285, pruned_loss=0.06919, over 4274714.75 frames. ], batch size: 333, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:52:44,399 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 14:52:45,087 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.01 vs. limit=15.0 2023-06-26 14:52:55,380 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.76 vs. limit=10.0 2023-06-26 14:53:03,330 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.822e+02 5.893e+02 8.169e+02 1.189e+03 2.218e+03, threshold=1.634e+03, percent-clipped=19.0 2023-06-26 14:53:44,824 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.49 vs. limit=15.0 2023-06-26 14:53:46,110 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1592862.0, ans=0.125 2023-06-26 14:54:05,005 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.64 vs. limit=10.0 2023-06-26 14:54:10,163 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1592922.0, ans=0.09899494936611666 2023-06-26 14:54:32,496 INFO [train.py:996] (0/4) Epoch 9, batch 21550, loss[loss=0.2252, simple_loss=0.2943, pruned_loss=0.0781, over 21470.00 frames. ], tot_loss[loss=0.2061, simple_loss=0.2782, pruned_loss=0.06696, over 4262753.57 frames. ], batch size: 211, lr: 3.25e-03, grad_scale: 8.0 2023-06-26 14:55:27,826 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1593162.0, ans=0.125 2023-06-26 14:55:31,403 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1593162.0, ans=0.125 2023-06-26 14:55:51,969 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.51 vs. limit=10.0 2023-06-26 14:55:57,119 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.86 vs. limit=15.0 2023-06-26 14:56:26,270 INFO [train.py:996] (0/4) Epoch 9, batch 21600, loss[loss=0.203, simple_loss=0.2966, pruned_loss=0.05469, over 21573.00 frames. ], tot_loss[loss=0.2019, simple_loss=0.2732, pruned_loss=0.06534, over 4264380.57 frames. ], batch size: 389, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:56:34,643 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1593342.0, ans=0.0 2023-06-26 14:56:53,156 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.217e+02 4.926e+02 7.373e+02 9.794e+02 2.336e+03, threshold=1.475e+03, percent-clipped=12.0 2023-06-26 14:57:56,841 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1593582.0, ans=0.0 2023-06-26 14:58:09,403 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.73 vs. limit=15.0 2023-06-26 14:58:15,076 INFO [train.py:996] (0/4) Epoch 9, batch 21650, loss[loss=0.206, simple_loss=0.2873, pruned_loss=0.06233, over 21239.00 frames. ], tot_loss[loss=0.203, simple_loss=0.2781, pruned_loss=0.06396, over 4273538.86 frames. ], batch size: 143, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:58:20,534 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1593642.0, ans=0.125 2023-06-26 14:58:54,055 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.37 vs. limit=12.0 2023-06-26 14:59:15,658 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1593762.0, ans=0.0 2023-06-26 14:59:19,975 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.49 vs. limit=15.0 2023-06-26 14:59:41,048 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1593822.0, ans=0.125 2023-06-26 14:59:48,045 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1593882.0, ans=0.0 2023-06-26 15:00:01,542 INFO [train.py:996] (0/4) Epoch 9, batch 21700, loss[loss=0.1785, simple_loss=0.2457, pruned_loss=0.05564, over 21363.00 frames. ], tot_loss[loss=0.2005, simple_loss=0.2776, pruned_loss=0.06173, over 4272041.75 frames. ], batch size: 160, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 15:00:16,133 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 15:00:17,740 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1593942.0, ans=0.2 2023-06-26 15:00:22,141 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.421e+02 4.737e+02 7.563e+02 1.159e+03 3.422e+03, threshold=1.513e+03, percent-clipped=14.0 2023-06-26 15:00:23,297 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.11 vs. limit=15.0 2023-06-26 15:00:25,959 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1594002.0, ans=0.1 2023-06-26 15:00:27,780 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1594002.0, ans=0.04949747468305833 2023-06-26 15:00:43,263 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1594002.0, ans=0.0 2023-06-26 15:01:47,607 INFO [train.py:996] (0/4) Epoch 9, batch 21750, loss[loss=0.1737, simple_loss=0.2366, pruned_loss=0.0554, over 21390.00 frames. ], tot_loss[loss=0.1985, simple_loss=0.2744, pruned_loss=0.06132, over 4277401.36 frames. ], batch size: 212, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:02:16,004 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1594302.0, ans=0.125 2023-06-26 15:02:50,591 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1594362.0, ans=0.125 2023-06-26 15:03:08,269 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1594422.0, ans=0.0 2023-06-26 15:03:37,430 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1594542.0, ans=0.1 2023-06-26 15:03:38,365 INFO [train.py:996] (0/4) Epoch 9, batch 21800, loss[loss=0.2176, simple_loss=0.3081, pruned_loss=0.06348, over 21621.00 frames. ], tot_loss[loss=0.1984, simple_loss=0.2721, pruned_loss=0.06236, over 4284750.76 frames. ], batch size: 298, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:03:54,409 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1594542.0, ans=0.125 2023-06-26 15:04:04,211 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.457e+02 4.843e+02 6.619e+02 9.442e+02 2.103e+03, threshold=1.324e+03, percent-clipped=2.0 2023-06-26 15:04:21,225 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.50 vs. limit=15.0 2023-06-26 15:04:32,549 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1594662.0, ans=0.2 2023-06-26 15:04:54,992 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.01 vs. limit=15.0 2023-06-26 15:04:56,756 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1594722.0, ans=0.0 2023-06-26 15:05:17,672 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1594782.0, ans=0.125 2023-06-26 15:05:22,588 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1594782.0, ans=0.0 2023-06-26 15:05:25,818 INFO [train.py:996] (0/4) Epoch 9, batch 21850, loss[loss=0.2168, simple_loss=0.292, pruned_loss=0.07081, over 21764.00 frames. ], tot_loss[loss=0.2042, simple_loss=0.2805, pruned_loss=0.0639, over 4277735.31 frames. ], batch size: 112, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:06:43,089 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.19 vs. limit=15.0 2023-06-26 15:06:47,456 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1595022.0, ans=0.0 2023-06-26 15:07:12,431 INFO [train.py:996] (0/4) Epoch 9, batch 21900, loss[loss=0.1819, simple_loss=0.2519, pruned_loss=0.05597, over 21754.00 frames. ], tot_loss[loss=0.2047, simple_loss=0.2797, pruned_loss=0.06487, over 4265583.65 frames. ], batch size: 124, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:07:32,258 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.75 vs. limit=10.0 2023-06-26 15:07:38,242 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.466e+02 4.571e+02 6.004e+02 8.081e+02 1.811e+03, threshold=1.201e+03, percent-clipped=9.0 2023-06-26 15:07:52,222 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1595202.0, ans=0.0 2023-06-26 15:07:55,654 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1595262.0, ans=0.125 2023-06-26 15:08:37,797 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1595322.0, ans=0.2 2023-06-26 15:08:42,728 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1595382.0, ans=0.2 2023-06-26 15:08:51,489 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1595382.0, ans=0.0 2023-06-26 15:08:58,142 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1595442.0, ans=0.125 2023-06-26 15:09:04,629 INFO [train.py:996] (0/4) Epoch 9, batch 21950, loss[loss=0.1718, simple_loss=0.252, pruned_loss=0.04575, over 21891.00 frames. ], tot_loss[loss=0.2013, simple_loss=0.2747, pruned_loss=0.06392, over 4274297.60 frames. ], batch size: 373, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:09:14,123 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1595442.0, ans=0.2 2023-06-26 15:09:40,113 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1595502.0, ans=0.125 2023-06-26 15:10:54,239 INFO [train.py:996] (0/4) Epoch 9, batch 22000, loss[loss=0.1921, simple_loss=0.2609, pruned_loss=0.06165, over 21597.00 frames. ], tot_loss[loss=0.196, simple_loss=0.2692, pruned_loss=0.06143, over 4271657.43 frames. ], batch size: 298, lr: 3.24e-03, grad_scale: 32.0 2023-06-26 15:11:15,760 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.199e+02 4.453e+02 7.165e+02 9.999e+02 1.931e+03, threshold=1.433e+03, percent-clipped=13.0 2023-06-26 15:11:28,031 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=1595802.0, ans=15.0 2023-06-26 15:12:22,027 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1595982.0, ans=0.125 2023-06-26 15:12:49,913 INFO [train.py:996] (0/4) Epoch 9, batch 22050, loss[loss=0.2776, simple_loss=0.3598, pruned_loss=0.09768, over 21622.00 frames. ], tot_loss[loss=0.2002, simple_loss=0.2745, pruned_loss=0.06289, over 4272142.39 frames. ], batch size: 441, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:13:09,335 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.55 vs. limit=15.0 2023-06-26 15:13:44,363 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1596162.0, ans=0.2 2023-06-26 15:13:44,421 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1596162.0, ans=0.0 2023-06-26 15:13:49,747 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.22 vs. limit=22.5 2023-06-26 15:14:06,648 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.93 vs. limit=15.0 2023-06-26 15:14:38,916 INFO [train.py:996] (0/4) Epoch 9, batch 22100, loss[loss=0.2313, simple_loss=0.3032, pruned_loss=0.07974, over 21332.00 frames. ], tot_loss[loss=0.2097, simple_loss=0.2851, pruned_loss=0.06714, over 4267407.10 frames. ], batch size: 159, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:14:48,494 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1596342.0, ans=0.1 2023-06-26 15:14:56,634 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.930e+02 6.282e+02 9.612e+02 1.455e+03 3.538e+03, threshold=1.922e+03, percent-clipped=29.0 2023-06-26 15:15:13,657 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1596402.0, ans=0.0 2023-06-26 15:15:37,863 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1596462.0, ans=0.0 2023-06-26 15:15:39,484 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1596462.0, ans=0.125 2023-06-26 15:16:14,191 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.79 vs. limit=15.0 2023-06-26 15:16:17,503 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.06 vs. limit=10.0 2023-06-26 15:16:26,359 INFO [train.py:996] (0/4) Epoch 9, batch 22150, loss[loss=0.1976, simple_loss=0.2779, pruned_loss=0.05864, over 21809.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2885, pruned_loss=0.06854, over 4263042.12 frames. ], batch size: 102, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:16:30,121 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1596642.0, ans=0.125 2023-06-26 15:16:46,527 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1596702.0, ans=0.0 2023-06-26 15:17:03,996 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1596702.0, ans=0.1 2023-06-26 15:17:26,591 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 15:17:51,585 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.71 vs. limit=15.0 2023-06-26 15:18:14,966 INFO [train.py:996] (0/4) Epoch 9, batch 22200, loss[loss=0.2161, simple_loss=0.3069, pruned_loss=0.06261, over 21879.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2898, pruned_loss=0.06949, over 4277452.67 frames. ], batch size: 316, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:18:17,837 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1596942.0, ans=0.125 2023-06-26 15:18:32,782 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.802e+02 5.060e+02 7.082e+02 1.053e+03 2.242e+03, threshold=1.416e+03, percent-clipped=3.0 2023-06-26 15:18:49,167 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1597002.0, ans=0.0 2023-06-26 15:18:49,230 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 15:19:05,566 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.09 vs. limit=15.0 2023-06-26 15:19:20,343 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.32 vs. limit=10.0 2023-06-26 15:20:04,046 INFO [train.py:996] (0/4) Epoch 9, batch 22250, loss[loss=0.2502, simple_loss=0.3262, pruned_loss=0.08706, over 21205.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.2965, pruned_loss=0.07121, over 4284916.37 frames. ], batch size: 143, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:20:58,020 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1597362.0, ans=0.125 2023-06-26 15:21:51,028 INFO [train.py:996] (0/4) Epoch 9, batch 22300, loss[loss=0.2092, simple_loss=0.3067, pruned_loss=0.0559, over 19930.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.2982, pruned_loss=0.07306, over 4288141.22 frames. ], batch size: 702, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:22:08,320 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.561e+02 5.384e+02 7.516e+02 1.079e+03 3.010e+03, threshold=1.503e+03, percent-clipped=16.0 2023-06-26 15:22:38,712 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1597662.0, ans=0.0 2023-06-26 15:22:40,491 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1597662.0, ans=0.0 2023-06-26 15:23:33,869 INFO [train.py:996] (0/4) Epoch 9, batch 22350, loss[loss=0.2032, simple_loss=0.2738, pruned_loss=0.06628, over 21496.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.2972, pruned_loss=0.07312, over 4293946.07 frames. ], batch size: 212, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:23:59,528 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1597902.0, ans=0.125 2023-06-26 15:24:01,550 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.63 vs. limit=22.5 2023-06-26 15:24:25,185 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1597962.0, ans=0.125 2023-06-26 15:24:30,365 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1597962.0, ans=0.0 2023-06-26 15:24:39,429 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1598022.0, ans=0.125 2023-06-26 15:25:05,151 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1598082.0, ans=0.2 2023-06-26 15:25:21,636 INFO [train.py:996] (0/4) Epoch 9, batch 22400, loss[loss=0.2153, simple_loss=0.2785, pruned_loss=0.07601, over 20066.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.2938, pruned_loss=0.07001, over 4292571.09 frames. ], batch size: 703, lr: 3.24e-03, grad_scale: 32.0 2023-06-26 15:25:49,395 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.723e+02 5.104e+02 6.690e+02 9.796e+02 2.008e+03, threshold=1.338e+03, percent-clipped=2.0 2023-06-26 15:26:10,489 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1598262.0, ans=0.2 2023-06-26 15:26:17,776 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1598262.0, ans=0.125 2023-06-26 15:26:21,239 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1598262.0, ans=0.1 2023-06-26 15:26:33,531 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.61 vs. limit=15.0 2023-06-26 15:26:53,892 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1598382.0, ans=0.07 2023-06-26 15:26:55,616 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1598382.0, ans=0.125 2023-06-26 15:27:14,433 INFO [train.py:996] (0/4) Epoch 9, batch 22450, loss[loss=0.1811, simple_loss=0.2413, pruned_loss=0.06049, over 21605.00 frames. ], tot_loss[loss=0.2123, simple_loss=0.2879, pruned_loss=0.0684, over 4280471.12 frames. ], batch size: 231, lr: 3.24e-03, grad_scale: 32.0 2023-06-26 15:27:47,868 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1598502.0, ans=0.125 2023-06-26 15:27:54,641 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1598562.0, ans=0.125 2023-06-26 15:28:29,973 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1598622.0, ans=0.125 2023-06-26 15:28:32,282 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.06 vs. limit=15.0 2023-06-26 15:28:50,008 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.21 vs. limit=15.0 2023-06-26 15:29:02,867 INFO [train.py:996] (0/4) Epoch 9, batch 22500, loss[loss=0.2247, simple_loss=0.3139, pruned_loss=0.06776, over 21193.00 frames. ], tot_loss[loss=0.2084, simple_loss=0.2815, pruned_loss=0.06767, over 4273752.99 frames. ], batch size: 176, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:29:17,669 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1598742.0, ans=0.125 2023-06-26 15:29:26,970 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.381e+02 5.166e+02 7.858e+02 1.138e+03 3.264e+03, threshold=1.572e+03, percent-clipped=12.0 2023-06-26 15:29:39,631 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1598802.0, ans=0.125 2023-06-26 15:29:54,248 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1598862.0, ans=0.0 2023-06-26 15:30:00,953 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1598862.0, ans=0.125 2023-06-26 15:30:03,093 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.42 vs. limit=15.0 2023-06-26 15:30:40,651 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1598982.0, ans=0.1 2023-06-26 15:30:57,536 INFO [train.py:996] (0/4) Epoch 9, batch 22550, loss[loss=0.2212, simple_loss=0.3188, pruned_loss=0.06181, over 20728.00 frames. ], tot_loss[loss=0.21, simple_loss=0.284, pruned_loss=0.06797, over 4277545.26 frames. ], batch size: 607, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:30:58,998 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.41 vs. limit=15.0 2023-06-26 15:31:28,266 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1599102.0, ans=0.0 2023-06-26 15:31:53,409 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1599162.0, ans=0.0 2023-06-26 15:32:18,809 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1599222.0, ans=0.125 2023-06-26 15:32:49,140 INFO [train.py:996] (0/4) Epoch 9, batch 22600, loss[loss=0.1457, simple_loss=0.1898, pruned_loss=0.05079, over 17076.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2881, pruned_loss=0.06931, over 4277251.08 frames. ], batch size: 65, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:33:08,791 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.130e+02 6.886e+02 1.082e+03 1.570e+03 3.521e+03, threshold=2.164e+03, percent-clipped=24.0 2023-06-26 15:33:38,945 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1599462.0, ans=0.0 2023-06-26 15:34:24,646 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1599582.0, ans=0.125 2023-06-26 15:34:33,111 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1599582.0, ans=0.125 2023-06-26 15:34:35,319 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.52 vs. limit=22.5 2023-06-26 15:34:37,817 INFO [train.py:996] (0/4) Epoch 9, batch 22650, loss[loss=0.2225, simple_loss=0.2975, pruned_loss=0.07369, over 21565.00 frames. ], tot_loss[loss=0.2119, simple_loss=0.2855, pruned_loss=0.06913, over 4276920.07 frames. ], batch size: 389, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:34:45,776 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=6.26 vs. limit=22.5 2023-06-26 15:34:57,357 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1599702.0, ans=0.125 2023-06-26 15:35:26,918 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.14 vs. limit=15.0 2023-06-26 15:35:56,470 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1599822.0, ans=0.1 2023-06-26 15:36:24,831 INFO [train.py:996] (0/4) Epoch 9, batch 22700, loss[loss=0.2112, simple_loss=0.2784, pruned_loss=0.07204, over 21845.00 frames. ], tot_loss[loss=0.2081, simple_loss=0.2797, pruned_loss=0.06829, over 4261188.66 frames. ], batch size: 372, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:36:36,263 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1599942.0, ans=0.125 2023-06-26 15:36:36,345 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1599942.0, ans=0.0 2023-06-26 15:36:44,331 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.472e+02 5.506e+02 7.412e+02 1.059e+03 2.032e+03, threshold=1.482e+03, percent-clipped=0.0 2023-06-26 15:36:59,333 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1600002.0, ans=0.1 2023-06-26 15:37:20,947 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1600062.0, ans=0.025 2023-06-26 15:37:54,161 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1600122.0, ans=0.05 2023-06-26 15:38:13,918 INFO [train.py:996] (0/4) Epoch 9, batch 22750, loss[loss=0.2082, simple_loss=0.2756, pruned_loss=0.0704, over 20688.00 frames. ], tot_loss[loss=0.2084, simple_loss=0.2794, pruned_loss=0.06871, over 4256980.64 frames. ], batch size: 607, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:38:23,582 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1600242.0, ans=0.125 2023-06-26 15:38:50,291 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1600302.0, ans=0.125 2023-06-26 15:39:17,821 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1600362.0, ans=0.125 2023-06-26 15:40:01,402 INFO [train.py:996] (0/4) Epoch 9, batch 22800, loss[loss=0.2311, simple_loss=0.3013, pruned_loss=0.08044, over 21235.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2851, pruned_loss=0.07175, over 4270116.44 frames. ], batch size: 143, lr: 3.24e-03, grad_scale: 32.0 2023-06-26 15:40:25,809 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.02 vs. limit=15.0 2023-06-26 15:40:28,036 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.571e+02 5.339e+02 7.756e+02 1.140e+03 2.355e+03, threshold=1.551e+03, percent-clipped=14.0 2023-06-26 15:40:33,546 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1600602.0, ans=0.125 2023-06-26 15:40:45,444 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1600662.0, ans=0.125 2023-06-26 15:41:03,231 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1600662.0, ans=0.0 2023-06-26 15:41:25,357 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1600722.0, ans=0.0 2023-06-26 15:41:27,178 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1600722.0, ans=0.125 2023-06-26 15:41:34,535 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1600782.0, ans=0.125 2023-06-26 15:41:49,514 INFO [train.py:996] (0/4) Epoch 9, batch 22850, loss[loss=0.2064, simple_loss=0.2855, pruned_loss=0.06365, over 21427.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2829, pruned_loss=0.07122, over 4272193.01 frames. ], batch size: 131, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:42:09,517 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.88 vs. limit=8.0 2023-06-26 15:42:30,454 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.54 vs. limit=15.0 2023-06-26 15:43:37,590 INFO [train.py:996] (0/4) Epoch 9, batch 22900, loss[loss=0.2158, simple_loss=0.3196, pruned_loss=0.05601, over 21809.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.283, pruned_loss=0.07023, over 4269508.34 frames. ], batch size: 282, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:44:04,408 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.622e+02 6.448e+02 8.997e+02 1.321e+03 2.993e+03, threshold=1.799e+03, percent-clipped=19.0 2023-06-26 15:44:05,847 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.34 vs. limit=22.5 2023-06-26 15:44:07,493 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1601202.0, ans=0.125 2023-06-26 15:44:19,220 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1601202.0, ans=0.05 2023-06-26 15:44:28,588 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1601262.0, ans=0.0 2023-06-26 15:44:32,813 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.40 vs. limit=22.5 2023-06-26 15:44:49,485 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.89 vs. limit=15.0 2023-06-26 15:45:06,450 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1601322.0, ans=0.0 2023-06-26 15:45:28,338 INFO [train.py:996] (0/4) Epoch 9, batch 22950, loss[loss=0.2096, simple_loss=0.3416, pruned_loss=0.03879, over 20761.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.2992, pruned_loss=0.07021, over 4270580.09 frames. ], batch size: 607, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:45:30,527 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1601442.0, ans=0.1 2023-06-26 15:45:34,308 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1601442.0, ans=0.0 2023-06-26 15:46:13,412 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1601562.0, ans=0.125 2023-06-26 15:46:33,370 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1601562.0, ans=0.0 2023-06-26 15:47:10,759 INFO [train.py:996] (0/4) Epoch 9, batch 23000, loss[loss=0.211, simple_loss=0.2857, pruned_loss=0.06815, over 21912.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.3001, pruned_loss=0.0682, over 4268788.33 frames. ], batch size: 316, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:47:42,632 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.591e+02 4.526e+02 6.178e+02 9.113e+02 2.510e+03, threshold=1.236e+03, percent-clipped=4.0 2023-06-26 15:48:24,425 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1601922.0, ans=0.2 2023-06-26 15:48:53,915 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=19.85 vs. limit=22.5 2023-06-26 15:48:55,173 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1601982.0, ans=0.2 2023-06-26 15:49:11,855 INFO [train.py:996] (0/4) Epoch 9, batch 23050, loss[loss=0.2272, simple_loss=0.3011, pruned_loss=0.0767, over 21232.00 frames. ], tot_loss[loss=0.22, simple_loss=0.3006, pruned_loss=0.0697, over 4273813.93 frames. ], batch size: 176, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:50:55,765 INFO [train.py:996] (0/4) Epoch 9, batch 23100, loss[loss=0.1984, simple_loss=0.26, pruned_loss=0.06845, over 21391.00 frames. ], tot_loss[loss=0.218, simple_loss=0.296, pruned_loss=0.06997, over 4271334.12 frames. ], batch size: 131, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:50:58,158 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1602342.0, ans=0.125 2023-06-26 15:51:03,265 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1602342.0, ans=0.0 2023-06-26 15:51:03,828 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=1602342.0, ans=15.0 2023-06-26 15:51:22,043 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.578e+02 4.836e+02 6.120e+02 9.547e+02 2.287e+03, threshold=1.224e+03, percent-clipped=14.0 2023-06-26 15:51:31,200 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1602402.0, ans=0.125 2023-06-26 15:52:44,512 INFO [train.py:996] (0/4) Epoch 9, batch 23150, loss[loss=0.1984, simple_loss=0.2737, pruned_loss=0.06154, over 21927.00 frames. ], tot_loss[loss=0.2142, simple_loss=0.2893, pruned_loss=0.06954, over 4270627.72 frames. ], batch size: 316, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:53:45,574 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1602822.0, ans=0.125 2023-06-26 15:53:57,490 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 15:54:25,260 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.28 vs. limit=15.0 2023-06-26 15:54:25,646 INFO [train.py:996] (0/4) Epoch 9, batch 23200, loss[loss=0.1976, simple_loss=0.2722, pruned_loss=0.06143, over 21675.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.2886, pruned_loss=0.07039, over 4276118.07 frames. ], batch size: 263, lr: 3.24e-03, grad_scale: 32.0 2023-06-26 15:54:57,772 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.454e+02 5.096e+02 6.731e+02 1.055e+03 2.311e+03, threshold=1.346e+03, percent-clipped=14.0 2023-06-26 15:55:05,011 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1603002.0, ans=0.125 2023-06-26 15:55:08,675 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1603002.0, ans=0.125 2023-06-26 15:55:46,393 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1603122.0, ans=0.1 2023-06-26 15:55:48,267 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1603122.0, ans=0.0 2023-06-26 15:55:50,236 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1603182.0, ans=0.0 2023-06-26 15:56:14,205 INFO [train.py:996] (0/4) Epoch 9, batch 23250, loss[loss=0.2057, simple_loss=0.2711, pruned_loss=0.07018, over 21497.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2873, pruned_loss=0.07083, over 4280927.37 frames. ], batch size: 194, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:57:37,079 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1603422.0, ans=0.2 2023-06-26 15:58:08,902 INFO [train.py:996] (0/4) Epoch 9, batch 23300, loss[loss=0.2543, simple_loss=0.3321, pruned_loss=0.0882, over 21336.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.2952, pruned_loss=0.07251, over 4273721.62 frames. ], batch size: 548, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:58:33,339 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1603602.0, ans=0.125 2023-06-26 15:58:37,852 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.694e+02 6.000e+02 9.033e+02 1.405e+03 3.617e+03, threshold=1.807e+03, percent-clipped=26.0 2023-06-26 15:59:10,120 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1603662.0, ans=0.125 2023-06-26 16:00:05,627 INFO [train.py:996] (0/4) Epoch 9, batch 23350, loss[loss=0.2315, simple_loss=0.3296, pruned_loss=0.06669, over 20711.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.3006, pruned_loss=0.07185, over 4265510.24 frames. ], batch size: 607, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 16:00:28,798 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1603902.0, ans=0.0 2023-06-26 16:01:02,013 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1604022.0, ans=0.0 2023-06-26 16:01:05,770 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=9.95 vs. limit=15.0 2023-06-26 16:01:26,095 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1604022.0, ans=0.1 2023-06-26 16:01:50,741 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1604082.0, ans=0.1 2023-06-26 16:01:52,569 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1604142.0, ans=0.0 2023-06-26 16:01:53,520 INFO [train.py:996] (0/4) Epoch 9, batch 23400, loss[loss=0.2089, simple_loss=0.2784, pruned_loss=0.0697, over 21772.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2926, pruned_loss=0.06812, over 4267973.10 frames. ], batch size: 247, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:02:15,703 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1604202.0, ans=0.0 2023-06-26 16:02:21,754 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.225e+02 5.479e+02 7.119e+02 1.024e+03 2.077e+03, threshold=1.424e+03, percent-clipped=2.0 2023-06-26 16:03:09,693 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1604322.0, ans=0.5 2023-06-26 16:03:21,912 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1604322.0, ans=0.0 2023-06-26 16:03:47,226 INFO [train.py:996] (0/4) Epoch 9, batch 23450, loss[loss=0.2166, simple_loss=0.2854, pruned_loss=0.07393, over 20771.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2924, pruned_loss=0.06977, over 4268765.56 frames. ], batch size: 608, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:03:57,814 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1604442.0, ans=0.2 2023-06-26 16:04:14,223 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1604502.0, ans=0.125 2023-06-26 16:05:28,897 INFO [train.py:996] (0/4) Epoch 9, batch 23500, loss[loss=0.2117, simple_loss=0.2817, pruned_loss=0.07083, over 21884.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.2932, pruned_loss=0.07188, over 4278169.82 frames. ], batch size: 371, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:05:53,520 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1604802.0, ans=0.125 2023-06-26 16:05:56,211 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.762e+02 6.548e+02 9.049e+02 1.310e+03 3.325e+03, threshold=1.810e+03, percent-clipped=21.0 2023-06-26 16:06:16,123 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.59 vs. limit=15.0 2023-06-26 16:07:15,845 INFO [train.py:996] (0/4) Epoch 9, batch 23550, loss[loss=0.1919, simple_loss=0.2497, pruned_loss=0.06706, over 21532.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2887, pruned_loss=0.0717, over 4273729.48 frames. ], batch size: 195, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:07:19,929 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1605042.0, ans=0.125 2023-06-26 16:07:36,653 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1605102.0, ans=0.0 2023-06-26 16:07:38,890 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.94 vs. limit=15.0 2023-06-26 16:07:47,184 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1605102.0, ans=0.0 2023-06-26 16:07:58,885 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.58 vs. limit=22.5 2023-06-26 16:08:22,745 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1605222.0, ans=0.125 2023-06-26 16:08:22,775 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1605222.0, ans=0.0 2023-06-26 16:08:22,807 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1605222.0, ans=0.125 2023-06-26 16:08:38,861 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1605222.0, ans=0.125 2023-06-26 16:08:38,881 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1605222.0, ans=0.0 2023-06-26 16:08:44,583 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.99 vs. limit=5.0 2023-06-26 16:09:04,150 INFO [train.py:996] (0/4) Epoch 9, batch 23600, loss[loss=0.2051, simple_loss=0.2865, pruned_loss=0.06183, over 21662.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2894, pruned_loss=0.07154, over 4262044.14 frames. ], batch size: 298, lr: 3.23e-03, grad_scale: 32.0 2023-06-26 16:09:32,654 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.846e+02 5.348e+02 7.374e+02 1.134e+03 2.536e+03, threshold=1.475e+03, percent-clipped=3.0 2023-06-26 16:10:55,359 INFO [train.py:996] (0/4) Epoch 9, batch 23650, loss[loss=0.2383, simple_loss=0.3164, pruned_loss=0.08008, over 21377.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2897, pruned_loss=0.07051, over 4259658.27 frames. ], batch size: 143, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:11:10,039 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1605642.0, ans=0.125 2023-06-26 16:11:10,045 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1605642.0, ans=0.2 2023-06-26 16:11:27,244 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1605702.0, ans=0.0 2023-06-26 16:11:27,940 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.90 vs. limit=15.0 2023-06-26 16:11:35,881 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1605702.0, ans=0.125 2023-06-26 16:11:48,090 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1605762.0, ans=0.2 2023-06-26 16:12:23,017 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff2.min_abs, batch_count=1605822.0, ans=0.1 2023-06-26 16:12:30,209 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1605882.0, ans=0.0 2023-06-26 16:12:43,751 INFO [train.py:996] (0/4) Epoch 9, batch 23700, loss[loss=0.229, simple_loss=0.3003, pruned_loss=0.07879, over 21203.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2925, pruned_loss=0.07007, over 4263946.29 frames. ], batch size: 143, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:13:08,969 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1606002.0, ans=0.2 2023-06-26 16:13:18,789 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.291e+02 4.703e+02 6.208e+02 8.925e+02 2.253e+03, threshold=1.242e+03, percent-clipped=5.0 2023-06-26 16:13:29,792 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1606002.0, ans=0.125 2023-06-26 16:13:49,639 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1606062.0, ans=0.125 2023-06-26 16:14:33,485 INFO [train.py:996] (0/4) Epoch 9, batch 23750, loss[loss=0.202, simple_loss=0.2743, pruned_loss=0.0648, over 20157.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.2956, pruned_loss=0.07065, over 4264441.72 frames. ], batch size: 702, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:15:14,747 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1606302.0, ans=0.125 2023-06-26 16:15:14,870 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1606302.0, ans=0.025 2023-06-26 16:15:49,679 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1606422.0, ans=0.2 2023-06-26 16:16:27,421 INFO [train.py:996] (0/4) Epoch 9, batch 23800, loss[loss=0.2454, simple_loss=0.3247, pruned_loss=0.08306, over 21407.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2938, pruned_loss=0.06844, over 4273364.28 frames. ], batch size: 471, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:16:43,112 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.93 vs. limit=10.0 2023-06-26 16:16:51,733 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1606542.0, ans=0.125 2023-06-26 16:17:03,066 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1606602.0, ans=0.025 2023-06-26 16:17:04,095 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.330e+02 5.235e+02 7.849e+02 1.092e+03 2.188e+03, threshold=1.570e+03, percent-clipped=19.0 2023-06-26 16:17:08,093 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1606602.0, ans=0.1 2023-06-26 16:17:15,635 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1606662.0, ans=0.0 2023-06-26 16:17:37,825 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1606722.0, ans=0.1 2023-06-26 16:17:37,931 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1606722.0, ans=0.025 2023-06-26 16:17:45,336 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1606722.0, ans=0.0 2023-06-26 16:17:47,775 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.48 vs. limit=15.0 2023-06-26 16:18:29,061 INFO [train.py:996] (0/4) Epoch 9, batch 23850, loss[loss=0.2203, simple_loss=0.3052, pruned_loss=0.06771, over 21494.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.3031, pruned_loss=0.07132, over 4273268.95 frames. ], batch size: 194, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:18:58,791 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1606902.0, ans=0.125 2023-06-26 16:19:16,793 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1606962.0, ans=0.1 2023-06-26 16:19:25,529 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1607022.0, ans=0.2 2023-06-26 16:20:06,900 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1607082.0, ans=0.1 2023-06-26 16:20:16,508 INFO [train.py:996] (0/4) Epoch 9, batch 23900, loss[loss=0.2086, simple_loss=0.2943, pruned_loss=0.06147, over 21558.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3102, pruned_loss=0.07302, over 4269357.33 frames. ], batch size: 263, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:20:45,484 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.978e+02 7.206e+02 9.928e+02 1.468e+03 4.059e+03, threshold=1.986e+03, percent-clipped=20.0 2023-06-26 16:20:53,239 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.45 vs. limit=15.0 2023-06-26 16:21:35,493 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1607322.0, ans=0.125 2023-06-26 16:21:50,596 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1607382.0, ans=0.125 2023-06-26 16:22:02,582 INFO [train.py:996] (0/4) Epoch 9, batch 23950, loss[loss=0.2543, simple_loss=0.314, pruned_loss=0.09728, over 21618.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.3034, pruned_loss=0.07252, over 4272856.96 frames. ], batch size: 441, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:22:04,531 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1607442.0, ans=0.0 2023-06-26 16:23:12,732 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1607622.0, ans=0.125 2023-06-26 16:23:18,629 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.81 vs. limit=15.0 2023-06-26 16:23:45,227 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=5.77 vs. limit=22.5 2023-06-26 16:23:50,353 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.42 vs. limit=15.0 2023-06-26 16:23:50,748 INFO [train.py:996] (0/4) Epoch 9, batch 24000, loss[loss=0.2606, simple_loss=0.335, pruned_loss=0.09311, over 21815.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3051, pruned_loss=0.07549, over 4281339.62 frames. ], batch size: 441, lr: 3.23e-03, grad_scale: 32.0 2023-06-26 16:23:50,749 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-26 16:24:10,700 INFO [train.py:1028] (0/4) Epoch 9, validation: loss=0.2632, simple_loss=0.3589, pruned_loss=0.0837, over 1796401.00 frames. 2023-06-26 16:24:10,701 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23714MB 2023-06-26 16:24:36,340 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.989e+02 5.715e+02 7.802e+02 1.213e+03 2.324e+03, threshold=1.560e+03, percent-clipped=4.0 2023-06-26 16:24:48,475 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.59 vs. limit=15.0 2023-06-26 16:25:21,049 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1607922.0, ans=0.0 2023-06-26 16:25:41,966 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-268000.pt 2023-06-26 16:26:00,889 INFO [train.py:996] (0/4) Epoch 9, batch 24050, loss[loss=0.2128, simple_loss=0.2895, pruned_loss=0.06799, over 20253.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3065, pruned_loss=0.0758, over 4275282.89 frames. ], batch size: 703, lr: 3.23e-03, grad_scale: 32.0 2023-06-26 16:26:17,992 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.84 vs. limit=6.0 2023-06-26 16:27:06,625 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1608162.0, ans=0.0 2023-06-26 16:27:23,182 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.20 vs. limit=15.0 2023-06-26 16:27:34,761 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1608282.0, ans=0.2 2023-06-26 16:27:49,864 INFO [train.py:996] (0/4) Epoch 9, batch 24100, loss[loss=0.23, simple_loss=0.2994, pruned_loss=0.08034, over 20064.00 frames. ], tot_loss[loss=0.2276, simple_loss=0.3068, pruned_loss=0.07425, over 4272311.36 frames. ], batch size: 702, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:28:01,665 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 16:28:27,569 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.963e+02 5.200e+02 7.145e+02 1.046e+03 2.381e+03, threshold=1.429e+03, percent-clipped=3.0 2023-06-26 16:28:29,715 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1608402.0, ans=0.125 2023-06-26 16:28:31,382 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1608402.0, ans=0.0 2023-06-26 16:29:13,674 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1608522.0, ans=0.1 2023-06-26 16:29:32,685 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1608582.0, ans=0.1 2023-06-26 16:29:39,192 INFO [train.py:996] (0/4) Epoch 9, batch 24150, loss[loss=0.2316, simple_loss=0.2982, pruned_loss=0.08245, over 21727.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3058, pruned_loss=0.07563, over 4281402.27 frames. ], batch size: 389, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:30:30,810 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.94 vs. limit=15.0 2023-06-26 16:31:29,818 INFO [train.py:996] (0/4) Epoch 9, batch 24200, loss[loss=0.2208, simple_loss=0.2895, pruned_loss=0.07604, over 21262.00 frames. ], tot_loss[loss=0.233, simple_loss=0.31, pruned_loss=0.07805, over 4287344.45 frames. ], batch size: 159, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:32:12,921 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.636e+02 5.699e+02 8.049e+02 1.259e+03 2.421e+03, threshold=1.610e+03, percent-clipped=17.0 2023-06-26 16:32:36,608 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1609062.0, ans=0.0 2023-06-26 16:33:31,028 INFO [train.py:996] (0/4) Epoch 9, batch 24250, loss[loss=0.1672, simple_loss=0.2536, pruned_loss=0.0404, over 21302.00 frames. ], tot_loss[loss=0.225, simple_loss=0.3063, pruned_loss=0.07185, over 4290633.34 frames. ], batch size: 143, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:34:45,490 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.45 vs. limit=8.0 2023-06-26 16:35:18,868 INFO [train.py:996] (0/4) Epoch 9, batch 24300, loss[loss=0.1621, simple_loss=0.2459, pruned_loss=0.03917, over 21715.00 frames. ], tot_loss[loss=0.216, simple_loss=0.2988, pruned_loss=0.0666, over 4280560.21 frames. ], batch size: 298, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:35:50,238 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.136e+02 4.084e+02 7.233e+02 1.324e+03 4.143e+03, threshold=1.447e+03, percent-clipped=16.0 2023-06-26 16:36:52,566 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.00 vs. limit=15.0 2023-06-26 16:37:07,377 INFO [train.py:996] (0/4) Epoch 9, batch 24350, loss[loss=0.1856, simple_loss=0.2406, pruned_loss=0.06529, over 20195.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2946, pruned_loss=0.06618, over 4281779.68 frames. ], batch size: 702, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:38:06,929 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1609962.0, ans=0.125 2023-06-26 16:38:25,812 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1610022.0, ans=0.0 2023-06-26 16:39:02,383 INFO [train.py:996] (0/4) Epoch 9, batch 24400, loss[loss=0.2209, simple_loss=0.2927, pruned_loss=0.07459, over 21472.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2982, pruned_loss=0.06894, over 4282911.46 frames. ], batch size: 194, lr: 3.23e-03, grad_scale: 32.0 2023-06-26 16:39:05,094 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1610142.0, ans=0.125 2023-06-26 16:39:17,549 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.90 vs. limit=10.0 2023-06-26 16:39:34,021 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.576e+02 5.148e+02 6.716e+02 1.029e+03 2.743e+03, threshold=1.343e+03, percent-clipped=7.0 2023-06-26 16:39:42,255 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 16:40:09,559 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.81 vs. limit=12.0 2023-06-26 16:40:30,004 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1610382.0, ans=0.125 2023-06-26 16:40:37,865 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1610382.0, ans=0.2 2023-06-26 16:40:52,862 INFO [train.py:996] (0/4) Epoch 9, batch 24450, loss[loss=0.2073, simple_loss=0.2983, pruned_loss=0.05815, over 21644.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.3018, pruned_loss=0.07002, over 4274916.22 frames. ], batch size: 247, lr: 3.23e-03, grad_scale: 32.0 2023-06-26 16:41:38,681 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1610562.0, ans=0.125 2023-06-26 16:42:41,539 INFO [train.py:996] (0/4) Epoch 9, batch 24500, loss[loss=0.1974, simple_loss=0.2803, pruned_loss=0.05723, over 21368.00 frames. ], tot_loss[loss=0.2219, simple_loss=0.3029, pruned_loss=0.07046, over 4278316.80 frames. ], batch size: 194, lr: 3.23e-03, grad_scale: 32.0 2023-06-26 16:42:59,783 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1610742.0, ans=0.0 2023-06-26 16:43:14,694 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.443e+02 5.135e+02 6.610e+02 1.095e+03 2.710e+03, threshold=1.322e+03, percent-clipped=12.0 2023-06-26 16:43:15,298 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1610802.0, ans=0.025 2023-06-26 16:43:33,749 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1610862.0, ans=0.1 2023-06-26 16:43:49,561 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1610922.0, ans=0.125 2023-06-26 16:44:35,193 INFO [train.py:996] (0/4) Epoch 9, batch 24550, loss[loss=0.2828, simple_loss=0.3564, pruned_loss=0.1046, over 21213.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.3038, pruned_loss=0.07198, over 4278997.69 frames. ], batch size: 143, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:44:44,769 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1611042.0, ans=0.125 2023-06-26 16:44:58,359 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1611102.0, ans=0.0 2023-06-26 16:45:02,055 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.72 vs. limit=15.0 2023-06-26 16:45:45,264 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1611222.0, ans=0.0 2023-06-26 16:46:09,814 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.82 vs. limit=10.0 2023-06-26 16:46:16,622 INFO [train.py:996] (0/4) Epoch 9, batch 24600, loss[loss=0.1905, simple_loss=0.2624, pruned_loss=0.05926, over 21700.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.2999, pruned_loss=0.07274, over 4271048.67 frames. ], batch size: 282, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:46:30,677 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1611342.0, ans=0.1 2023-06-26 16:46:48,954 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.731e+02 5.495e+02 6.731e+02 9.246e+02 1.741e+03, threshold=1.346e+03, percent-clipped=6.0 2023-06-26 16:47:25,606 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.93 vs. limit=10.0 2023-06-26 16:48:05,286 INFO [train.py:996] (0/4) Epoch 9, batch 24650, loss[loss=0.1799, simple_loss=0.248, pruned_loss=0.05586, over 21769.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2912, pruned_loss=0.07168, over 4268632.90 frames. ], batch size: 300, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:48:16,097 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1611642.0, ans=0.125 2023-06-26 16:48:25,149 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1611642.0, ans=0.125 2023-06-26 16:49:30,478 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.70 vs. limit=15.0 2023-06-26 16:49:58,456 INFO [train.py:996] (0/4) Epoch 9, batch 24700, loss[loss=0.2038, simple_loss=0.2773, pruned_loss=0.06518, over 21784.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.2885, pruned_loss=0.0694, over 4271946.88 frames. ], batch size: 112, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:50:16,593 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1612002.0, ans=0.0 2023-06-26 16:50:31,842 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.355e+02 4.942e+02 6.984e+02 9.406e+02 2.267e+03, threshold=1.397e+03, percent-clipped=8.0 2023-06-26 16:50:52,811 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1612062.0, ans=0.1 2023-06-26 16:51:06,140 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1612122.0, ans=0.125 2023-06-26 16:51:46,656 INFO [train.py:996] (0/4) Epoch 9, batch 24750, loss[loss=0.235, simple_loss=0.285, pruned_loss=0.09246, over 21418.00 frames. ], tot_loss[loss=0.209, simple_loss=0.2828, pruned_loss=0.06757, over 4269382.65 frames. ], batch size: 509, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:52:01,254 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1612242.0, ans=0.0 2023-06-26 16:52:07,775 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.41 vs. limit=15.0 2023-06-26 16:52:21,504 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1612302.0, ans=0.0 2023-06-26 16:52:35,583 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1612362.0, ans=0.2 2023-06-26 16:52:38,739 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1612362.0, ans=0.035 2023-06-26 16:53:06,900 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.16 vs. limit=15.0 2023-06-26 16:53:25,219 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1612482.0, ans=0.0 2023-06-26 16:53:29,880 INFO [train.py:996] (0/4) Epoch 9, batch 24800, loss[loss=0.2341, simple_loss=0.2792, pruned_loss=0.0945, over 21627.00 frames. ], tot_loss[loss=0.206, simple_loss=0.2782, pruned_loss=0.06688, over 4274476.74 frames. ], batch size: 508, lr: 3.23e-03, grad_scale: 32.0 2023-06-26 16:53:42,603 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1612542.0, ans=0.125 2023-06-26 16:54:10,074 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.499e+02 5.335e+02 8.218e+02 1.489e+03 3.682e+03, threshold=1.644e+03, percent-clipped=29.0 2023-06-26 16:54:24,412 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1612662.0, ans=0.125 2023-06-26 16:54:28,305 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.80 vs. limit=15.0 2023-06-26 16:55:15,575 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1612782.0, ans=0.125 2023-06-26 16:55:17,116 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1612782.0, ans=0.125 2023-06-26 16:55:20,287 INFO [train.py:996] (0/4) Epoch 9, batch 24850, loss[loss=0.2093, simple_loss=0.2881, pruned_loss=0.06531, over 21057.00 frames. ], tot_loss[loss=0.209, simple_loss=0.2803, pruned_loss=0.06882, over 4282055.41 frames. ], batch size: 608, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:55:22,519 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1612842.0, ans=0.025 2023-06-26 16:55:43,741 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1612902.0, ans=0.0 2023-06-26 16:56:08,500 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=3.847e-03 2023-06-26 16:56:13,506 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1612962.0, ans=0.125 2023-06-26 16:57:11,754 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1613082.0, ans=0.035 2023-06-26 16:57:14,614 INFO [train.py:996] (0/4) Epoch 9, batch 24900, loss[loss=0.2249, simple_loss=0.2988, pruned_loss=0.07554, over 21822.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2818, pruned_loss=0.06932, over 4283978.23 frames. ], batch size: 247, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:57:54,887 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.038e+02 5.408e+02 8.463e+02 1.347e+03 2.375e+03, threshold=1.693e+03, percent-clipped=14.0 2023-06-26 16:58:32,447 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1613322.0, ans=0.2 2023-06-26 16:59:00,718 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1613382.0, ans=0.2 2023-06-26 16:59:11,121 INFO [train.py:996] (0/4) Epoch 9, batch 24950, loss[loss=0.1845, simple_loss=0.2353, pruned_loss=0.06687, over 20298.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2891, pruned_loss=0.07284, over 4285527.52 frames. ], batch size: 703, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:59:34,939 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.33 vs. limit=15.0 2023-06-26 17:00:32,913 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1613622.0, ans=0.1 2023-06-26 17:00:50,397 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1613682.0, ans=0.0 2023-06-26 17:01:05,686 INFO [train.py:996] (0/4) Epoch 9, batch 25000, loss[loss=0.1992, simple_loss=0.2806, pruned_loss=0.05895, over 21839.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.2967, pruned_loss=0.07533, over 4281055.98 frames. ], batch size: 118, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 17:01:13,695 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.77 vs. limit=15.0 2023-06-26 17:01:40,152 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.080e+02 5.287e+02 8.385e+02 1.349e+03 3.356e+03, threshold=1.677e+03, percent-clipped=10.0 2023-06-26 17:01:42,553 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1613802.0, ans=0.125 2023-06-26 17:01:58,363 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 17:02:44,574 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1613982.0, ans=0.125 2023-06-26 17:02:51,857 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1614042.0, ans=0.1 2023-06-26 17:02:52,725 INFO [train.py:996] (0/4) Epoch 9, batch 25050, loss[loss=0.2021, simple_loss=0.2502, pruned_loss=0.077, over 20295.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.2897, pruned_loss=0.07384, over 4272282.35 frames. ], batch size: 703, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:03:53,017 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1614162.0, ans=0.125 2023-06-26 17:04:08,553 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1614222.0, ans=0.2 2023-06-26 17:04:30,391 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.12 vs. limit=10.0 2023-06-26 17:04:40,832 INFO [train.py:996] (0/4) Epoch 9, batch 25100, loss[loss=0.2089, simple_loss=0.3068, pruned_loss=0.05548, over 21259.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2844, pruned_loss=0.07229, over 4272193.58 frames. ], batch size: 548, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:04:46,350 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1614342.0, ans=0.125 2023-06-26 17:05:12,759 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1614402.0, ans=0.0 2023-06-26 17:05:15,431 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.352e+02 5.789e+02 8.437e+02 1.364e+03 2.592e+03, threshold=1.687e+03, percent-clipped=13.0 2023-06-26 17:05:36,782 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1614522.0, ans=0.125 2023-06-26 17:05:47,604 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.91 vs. limit=15.0 2023-06-26 17:06:16,739 INFO [train.py:996] (0/4) Epoch 9, batch 25150, loss[loss=0.2741, simple_loss=0.3369, pruned_loss=0.1057, over 21709.00 frames. ], tot_loss[loss=0.215, simple_loss=0.2887, pruned_loss=0.0706, over 4271970.72 frames. ], batch size: 508, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:06:43,451 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1614702.0, ans=0.125 2023-06-26 17:06:53,825 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1614702.0, ans=0.0 2023-06-26 17:07:12,515 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.71 vs. limit=6.0 2023-06-26 17:07:59,213 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.49 vs. limit=15.0 2023-06-26 17:08:05,177 INFO [train.py:996] (0/4) Epoch 9, batch 25200, loss[loss=0.1932, simple_loss=0.2864, pruned_loss=0.05, over 21588.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2885, pruned_loss=0.06838, over 4264818.41 frames. ], batch size: 263, lr: 3.22e-03, grad_scale: 32.0 2023-06-26 17:08:11,172 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1614942.0, ans=0.0 2023-06-26 17:08:33,783 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1615002.0, ans=0.125 2023-06-26 17:08:35,561 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1615002.0, ans=0.125 2023-06-26 17:08:42,269 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1615002.0, ans=0.0 2023-06-26 17:08:50,508 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.256e+02 4.732e+02 7.162e+02 1.048e+03 3.410e+03, threshold=1.432e+03, percent-clipped=11.0 2023-06-26 17:09:09,883 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1615062.0, ans=0.1 2023-06-26 17:09:52,309 INFO [train.py:996] (0/4) Epoch 9, batch 25250, loss[loss=0.2018, simple_loss=0.2692, pruned_loss=0.06717, over 21199.00 frames. ], tot_loss[loss=0.2088, simple_loss=0.2854, pruned_loss=0.06607, over 4257626.26 frames. ], batch size: 159, lr: 3.22e-03, grad_scale: 32.0 2023-06-26 17:10:01,009 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1615242.0, ans=0.1 2023-06-26 17:10:35,101 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1615362.0, ans=0.2 2023-06-26 17:10:58,098 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1615422.0, ans=0.125 2023-06-26 17:11:08,084 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.34 vs. limit=6.0 2023-06-26 17:11:39,535 INFO [train.py:996] (0/4) Epoch 9, batch 25300, loss[loss=0.214, simple_loss=0.302, pruned_loss=0.06304, over 21321.00 frames. ], tot_loss[loss=0.2068, simple_loss=0.2826, pruned_loss=0.06549, over 4239150.49 frames. ], batch size: 548, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:12:16,251 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1615602.0, ans=0.0 2023-06-26 17:12:22,273 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.284e+02 5.811e+02 7.982e+02 1.248e+03 2.930e+03, threshold=1.596e+03, percent-clipped=17.0 2023-06-26 17:13:11,505 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1615782.0, ans=0.0 2023-06-26 17:13:19,711 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1615782.0, ans=0.0 2023-06-26 17:13:29,744 INFO [train.py:996] (0/4) Epoch 9, batch 25350, loss[loss=0.1664, simple_loss=0.2536, pruned_loss=0.03963, over 21591.00 frames. ], tot_loss[loss=0.2064, simple_loss=0.283, pruned_loss=0.06484, over 4230709.51 frames. ], batch size: 230, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:14:38,978 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1616022.0, ans=0.2 2023-06-26 17:15:00,662 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1616082.0, ans=0.0 2023-06-26 17:15:17,093 INFO [train.py:996] (0/4) Epoch 9, batch 25400, loss[loss=0.1954, simple_loss=0.2626, pruned_loss=0.0641, over 21523.00 frames. ], tot_loss[loss=0.2035, simple_loss=0.2793, pruned_loss=0.06383, over 4227932.47 frames. ], batch size: 441, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:15:52,432 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1616202.0, ans=0.04949747468305833 2023-06-26 17:15:58,570 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.261e+02 5.059e+02 8.454e+02 1.158e+03 2.444e+03, threshold=1.691e+03, percent-clipped=8.0 2023-06-26 17:16:17,598 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1616262.0, ans=0.125 2023-06-26 17:16:46,900 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1616382.0, ans=0.0 2023-06-26 17:17:05,794 INFO [train.py:996] (0/4) Epoch 9, batch 25450, loss[loss=0.227, simple_loss=0.3065, pruned_loss=0.07372, over 21284.00 frames. ], tot_loss[loss=0.2056, simple_loss=0.2802, pruned_loss=0.06553, over 4237548.68 frames. ], batch size: 143, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:17:22,336 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1616442.0, ans=0.0 2023-06-26 17:17:35,076 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1616502.0, ans=0.125 2023-06-26 17:17:53,455 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1616502.0, ans=0.0 2023-06-26 17:18:21,372 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.26 vs. limit=15.0 2023-06-26 17:18:54,625 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.91 vs. limit=15.0 2023-06-26 17:18:55,042 INFO [train.py:996] (0/4) Epoch 9, batch 25500, loss[loss=0.1697, simple_loss=0.2586, pruned_loss=0.04038, over 21644.00 frames. ], tot_loss[loss=0.2017, simple_loss=0.2794, pruned_loss=0.06202, over 4242847.65 frames. ], batch size: 230, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:19:43,310 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.118e+02 5.222e+02 7.710e+02 1.108e+03 2.263e+03, threshold=1.542e+03, percent-clipped=6.0 2023-06-26 17:20:21,984 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.54 vs. limit=22.5 2023-06-26 17:20:48,929 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.77 vs. limit=15.0 2023-06-26 17:20:56,307 INFO [train.py:996] (0/4) Epoch 9, batch 25550, loss[loss=0.1761, simple_loss=0.248, pruned_loss=0.05213, over 15897.00 frames. ], tot_loss[loss=0.206, simple_loss=0.2866, pruned_loss=0.06267, over 4242779.92 frames. ], batch size: 60, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:20:56,945 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1617042.0, ans=0.125 2023-06-26 17:21:26,640 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1617102.0, ans=0.09899494936611666 2023-06-26 17:21:56,253 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1617162.0, ans=0.1 2023-06-26 17:22:46,559 INFO [train.py:996] (0/4) Epoch 9, batch 25600, loss[loss=0.271, simple_loss=0.3476, pruned_loss=0.09718, over 21222.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.2911, pruned_loss=0.0638, over 4257826.30 frames. ], batch size: 143, lr: 3.22e-03, grad_scale: 32.0 2023-06-26 17:23:05,381 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1617342.0, ans=0.0 2023-06-26 17:23:12,920 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1617402.0, ans=0.2 2023-06-26 17:23:29,866 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.761e+02 5.184e+02 7.757e+02 1.041e+03 2.426e+03, threshold=1.551e+03, percent-clipped=8.0 2023-06-26 17:23:39,014 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff3.min_abs, batch_count=1617462.0, ans=0.2 2023-06-26 17:24:36,585 INFO [train.py:996] (0/4) Epoch 9, batch 25650, loss[loss=0.2071, simple_loss=0.276, pruned_loss=0.06908, over 21275.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2927, pruned_loss=0.06685, over 4256172.75 frames. ], batch size: 144, lr: 3.22e-03, grad_scale: 32.0 2023-06-26 17:24:37,193 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1617642.0, ans=0.125 2023-06-26 17:25:13,638 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1617702.0, ans=0.2 2023-06-26 17:25:24,177 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1617762.0, ans=0.035 2023-06-26 17:25:36,077 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1617762.0, ans=0.1 2023-06-26 17:25:45,753 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.14 vs. limit=15.0 2023-06-26 17:26:06,459 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.48 vs. limit=22.5 2023-06-26 17:26:12,643 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1617882.0, ans=0.0 2023-06-26 17:26:24,647 INFO [train.py:996] (0/4) Epoch 9, batch 25700, loss[loss=0.2357, simple_loss=0.312, pruned_loss=0.07974, over 21440.00 frames. ], tot_loss[loss=0.213, simple_loss=0.2908, pruned_loss=0.06763, over 4252157.20 frames. ], batch size: 131, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:26:34,681 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1617942.0, ans=0.0 2023-06-26 17:26:41,562 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1617942.0, ans=0.125 2023-06-26 17:26:59,417 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1618002.0, ans=0.125 2023-06-26 17:27:02,623 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1618002.0, ans=0.2 2023-06-26 17:27:07,785 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1618002.0, ans=0.125 2023-06-26 17:27:07,845 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1618002.0, ans=0.2 2023-06-26 17:27:08,745 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.793e+02 5.331e+02 7.573e+02 1.078e+03 3.200e+03, threshold=1.515e+03, percent-clipped=12.0 2023-06-26 17:27:13,949 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.85 vs. limit=15.0 2023-06-26 17:27:18,671 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1618062.0, ans=0.125 2023-06-26 17:27:49,152 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1618122.0, ans=0.125 2023-06-26 17:28:19,549 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.87 vs. limit=15.0 2023-06-26 17:28:21,571 INFO [train.py:996] (0/4) Epoch 9, batch 25750, loss[loss=0.2157, simple_loss=0.2806, pruned_loss=0.07544, over 20018.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.296, pruned_loss=0.07023, over 4257684.13 frames. ], batch size: 702, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:28:25,720 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1618242.0, ans=0.125 2023-06-26 17:28:47,589 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1618302.0, ans=0.125 2023-06-26 17:28:49,155 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1618302.0, ans=0.0 2023-06-26 17:30:08,940 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1618482.0, ans=0.0 2023-06-26 17:30:18,714 INFO [train.py:996] (0/4) Epoch 9, batch 25800, loss[loss=0.284, simple_loss=0.3555, pruned_loss=0.1062, over 21780.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.3073, pruned_loss=0.07452, over 4262346.54 frames. ], batch size: 441, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:31:03,957 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.718e+02 5.908e+02 7.803e+02 1.133e+03 2.789e+03, threshold=1.561e+03, percent-clipped=14.0 2023-06-26 17:31:04,490 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=1618662.0, ans=0.5 2023-06-26 17:32:08,631 INFO [train.py:996] (0/4) Epoch 9, batch 25850, loss[loss=0.2388, simple_loss=0.3028, pruned_loss=0.08736, over 21343.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3091, pruned_loss=0.07484, over 4263708.76 frames. ], batch size: 143, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:32:27,418 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1618842.0, ans=0.2 2023-06-26 17:32:51,628 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1618902.0, ans=0.125 2023-06-26 17:33:16,441 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1619022.0, ans=0.0 2023-06-26 17:33:29,461 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.12 vs. limit=15.0 2023-06-26 17:33:39,896 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1619082.0, ans=0.0 2023-06-26 17:33:39,912 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1619082.0, ans=0.125 2023-06-26 17:33:48,476 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1619082.0, ans=0.0 2023-06-26 17:34:03,382 INFO [train.py:996] (0/4) Epoch 9, batch 25900, loss[loss=0.2592, simple_loss=0.3566, pruned_loss=0.08087, over 21698.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.3094, pruned_loss=0.07544, over 4270305.72 frames. ], batch size: 298, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:34:15,179 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.28 vs. limit=15.0 2023-06-26 17:34:30,296 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1619202.0, ans=0.2 2023-06-26 17:34:47,607 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.581e+02 5.400e+02 8.685e+02 1.109e+03 2.488e+03, threshold=1.737e+03, percent-clipped=11.0 2023-06-26 17:35:06,585 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.41 vs. limit=12.0 2023-06-26 17:35:26,594 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1619322.0, ans=10.0 2023-06-26 17:35:58,963 INFO [train.py:996] (0/4) Epoch 9, batch 25950, loss[loss=0.2471, simple_loss=0.3263, pruned_loss=0.08392, over 21929.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3154, pruned_loss=0.07788, over 4273560.00 frames. ], batch size: 372, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:36:15,375 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.59 vs. limit=15.0 2023-06-26 17:36:44,149 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1619562.0, ans=0.125 2023-06-26 17:36:50,704 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=1619562.0, ans=0.05 2023-06-26 17:37:02,959 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1619622.0, ans=0.125 2023-06-26 17:37:10,016 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1619622.0, ans=0.0 2023-06-26 17:37:22,678 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1619622.0, ans=0.0 2023-06-26 17:37:49,221 INFO [train.py:996] (0/4) Epoch 9, batch 26000, loss[loss=0.242, simple_loss=0.3252, pruned_loss=0.0794, over 21964.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3139, pruned_loss=0.07566, over 4274329.62 frames. ], batch size: 317, lr: 3.22e-03, grad_scale: 32.0 2023-06-26 17:38:23,550 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1619802.0, ans=0.125 2023-06-26 17:38:33,552 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.565e+02 5.045e+02 5.850e+02 7.861e+02 1.944e+03, threshold=1.170e+03, percent-clipped=2.0 2023-06-26 17:39:37,993 INFO [train.py:996] (0/4) Epoch 9, batch 26050, loss[loss=0.2869, simple_loss=0.3307, pruned_loss=0.1215, over 21726.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3147, pruned_loss=0.07752, over 4271601.71 frames. ], batch size: 508, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:39:43,426 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 17:40:03,064 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1620102.0, ans=0.125 2023-06-26 17:40:18,765 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1620102.0, ans=0.1 2023-06-26 17:40:39,592 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1620222.0, ans=0.0 2023-06-26 17:41:04,163 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1620282.0, ans=0.2 2023-06-26 17:41:21,096 INFO [train.py:996] (0/4) Epoch 9, batch 26100, loss[loss=0.2246, simple_loss=0.2923, pruned_loss=0.07839, over 21890.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3091, pruned_loss=0.07663, over 4279308.48 frames. ], batch size: 371, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:41:55,083 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1620402.0, ans=0.125 2023-06-26 17:41:59,134 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.83 vs. limit=15.0 2023-06-26 17:42:06,206 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.31 vs. limit=15.0 2023-06-26 17:42:06,467 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.994e+02 5.585e+02 7.440e+02 1.140e+03 2.110e+03, threshold=1.488e+03, percent-clipped=23.0 2023-06-26 17:42:30,323 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1620522.0, ans=0.125 2023-06-26 17:42:30,845 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.50 vs. limit=15.0 2023-06-26 17:42:46,111 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1620582.0, ans=0.0 2023-06-26 17:43:04,899 INFO [train.py:996] (0/4) Epoch 9, batch 26150, loss[loss=0.2481, simple_loss=0.3359, pruned_loss=0.08022, over 21831.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3066, pruned_loss=0.0765, over 4282185.38 frames. ], batch size: 124, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:43:41,748 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1620702.0, ans=0.125 2023-06-26 17:44:06,678 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1620762.0, ans=0.0 2023-06-26 17:45:00,372 INFO [train.py:996] (0/4) Epoch 9, batch 26200, loss[loss=0.2154, simple_loss=0.2707, pruned_loss=0.08005, over 20064.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.3062, pruned_loss=0.07406, over 4286009.54 frames. ], batch size: 702, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:45:15,496 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1620942.0, ans=0.1 2023-06-26 17:45:41,654 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.840e+02 5.161e+02 8.097e+02 1.241e+03 2.329e+03, threshold=1.619e+03, percent-clipped=17.0 2023-06-26 17:46:55,704 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1621242.0, ans=0.125 2023-06-26 17:46:56,622 INFO [train.py:996] (0/4) Epoch 9, batch 26250, loss[loss=0.2214, simple_loss=0.2945, pruned_loss=0.07412, over 21516.00 frames. ], tot_loss[loss=0.227, simple_loss=0.3089, pruned_loss=0.07255, over 4278771.07 frames. ], batch size: 194, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:47:40,318 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1621362.0, ans=0.125 2023-06-26 17:47:55,981 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1621362.0, ans=0.05 2023-06-26 17:48:15,346 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1621482.0, ans=0.1 2023-06-26 17:48:44,885 INFO [train.py:996] (0/4) Epoch 9, batch 26300, loss[loss=0.1952, simple_loss=0.2715, pruned_loss=0.05939, over 21680.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.3053, pruned_loss=0.07276, over 4285726.81 frames. ], batch size: 263, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:49:25,499 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.773e+02 5.088e+02 7.206e+02 1.171e+03 1.823e+03, threshold=1.441e+03, percent-clipped=7.0 2023-06-26 17:50:23,875 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.16 vs. limit=6.0 2023-06-26 17:50:34,488 INFO [train.py:996] (0/4) Epoch 9, batch 26350, loss[loss=0.2607, simple_loss=0.3325, pruned_loss=0.09441, over 21479.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.3037, pruned_loss=0.07357, over 4289130.96 frames. ], batch size: 194, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:50:38,889 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1621842.0, ans=0.125 2023-06-26 17:50:41,031 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1621842.0, ans=0.125 2023-06-26 17:51:17,199 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1621902.0, ans=0.0 2023-06-26 17:52:14,986 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.84 vs. limit=5.0 2023-06-26 17:52:23,822 INFO [train.py:996] (0/4) Epoch 9, batch 26400, loss[loss=0.194, simple_loss=0.2543, pruned_loss=0.06687, over 21610.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.2989, pruned_loss=0.07409, over 4274492.74 frames. ], batch size: 231, lr: 3.22e-03, grad_scale: 32.0 2023-06-26 17:52:24,420 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1622142.0, ans=0.0 2023-06-26 17:53:12,759 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.822e+02 5.037e+02 6.959e+02 9.647e+02 1.675e+03, threshold=1.392e+03, percent-clipped=4.0 2023-06-26 17:53:17,234 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1622262.0, ans=0.125 2023-06-26 17:53:44,428 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1622322.0, ans=0.125 2023-06-26 17:53:48,404 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1622322.0, ans=0.0 2023-06-26 17:53:59,428 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1622382.0, ans=0.07 2023-06-26 17:54:05,498 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.97 vs. limit=22.5 2023-06-26 17:54:16,760 INFO [train.py:996] (0/4) Epoch 9, batch 26450, loss[loss=0.2442, simple_loss=0.3294, pruned_loss=0.07952, over 21585.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.2992, pruned_loss=0.0739, over 4274211.96 frames. ], batch size: 230, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:54:37,798 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.82 vs. limit=10.0 2023-06-26 17:56:06,599 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1622742.0, ans=0.1 2023-06-26 17:56:13,610 INFO [train.py:996] (0/4) Epoch 9, batch 26500, loss[loss=0.2014, simple_loss=0.278, pruned_loss=0.06245, over 21634.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.301, pruned_loss=0.07269, over 4265827.15 frames. ], batch size: 263, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:57:02,411 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1622802.0, ans=0.125 2023-06-26 17:57:07,193 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.802e+02 5.662e+02 1.052e+03 1.637e+03 4.186e+03, threshold=2.103e+03, percent-clipped=36.0 2023-06-26 17:57:53,749 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1622982.0, ans=0.2 2023-06-26 17:58:06,120 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1622982.0, ans=0.125 2023-06-26 17:58:11,208 INFO [train.py:996] (0/4) Epoch 9, batch 26550, loss[loss=0.1912, simple_loss=0.3004, pruned_loss=0.04096, over 20790.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.2976, pruned_loss=0.06984, over 4270134.61 frames. ], batch size: 608, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:58:23,822 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1623042.0, ans=0.125 2023-06-26 17:58:34,592 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1623042.0, ans=0.2 2023-06-26 17:58:47,919 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.75 vs. limit=15.0 2023-06-26 17:59:35,586 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1623222.0, ans=0.125 2023-06-26 18:00:01,005 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1623282.0, ans=0.1 2023-06-26 18:00:05,343 INFO [train.py:996] (0/4) Epoch 9, batch 26600, loss[loss=0.2286, simple_loss=0.2895, pruned_loss=0.08386, over 20152.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2964, pruned_loss=0.06711, over 4273464.80 frames. ], batch size: 702, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 18:00:41,752 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1623402.0, ans=0.07 2023-06-26 18:00:46,540 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1623462.0, ans=0.0 2023-06-26 18:00:47,581 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.393e+02 5.073e+02 7.169e+02 1.139e+03 3.123e+03, threshold=1.434e+03, percent-clipped=9.0 2023-06-26 18:00:48,811 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.33 vs. limit=15.0 2023-06-26 18:01:11,289 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1623462.0, ans=0.2 2023-06-26 18:01:59,711 INFO [train.py:996] (0/4) Epoch 9, batch 26650, loss[loss=0.1724, simple_loss=0.2584, pruned_loss=0.04317, over 21871.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.2895, pruned_loss=0.06584, over 4272527.85 frames. ], batch size: 373, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 18:02:37,046 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1623762.0, ans=0.125 2023-06-26 18:03:03,393 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1623822.0, ans=0.125 2023-06-26 18:03:04,814 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1623822.0, ans=0.0 2023-06-26 18:03:06,584 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1623822.0, ans=0.1 2023-06-26 18:03:40,947 INFO [train.py:996] (0/4) Epoch 9, batch 26700, loss[loss=0.24, simple_loss=0.2962, pruned_loss=0.09195, over 21776.00 frames. ], tot_loss[loss=0.2038, simple_loss=0.2819, pruned_loss=0.06281, over 4256589.96 frames. ], batch size: 508, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 18:03:51,331 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.84 vs. limit=22.5 2023-06-26 18:04:29,909 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.831e+02 4.080e+02 5.616e+02 9.381e+02 2.662e+03, threshold=1.123e+03, percent-clipped=11.0 2023-06-26 18:05:00,047 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1624122.0, ans=0.125 2023-06-26 18:05:29,859 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1624182.0, ans=0.0 2023-06-26 18:05:36,287 INFO [train.py:996] (0/4) Epoch 9, batch 26750, loss[loss=0.2012, simple_loss=0.2963, pruned_loss=0.05305, over 20768.00 frames. ], tot_loss[loss=0.2045, simple_loss=0.2834, pruned_loss=0.0628, over 4262131.71 frames. ], batch size: 607, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 18:05:37,084 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1624242.0, ans=0.0 2023-06-26 18:05:53,439 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.73 vs. limit=22.5 2023-06-26 18:06:36,104 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1624362.0, ans=0.0 2023-06-26 18:07:00,996 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1624422.0, ans=0.125 2023-06-26 18:07:18,936 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1624482.0, ans=0.125 2023-06-26 18:07:27,091 INFO [train.py:996] (0/4) Epoch 9, batch 26800, loss[loss=0.2533, simple_loss=0.331, pruned_loss=0.08781, over 21365.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.2909, pruned_loss=0.06691, over 4272576.75 frames. ], batch size: 159, lr: 3.21e-03, grad_scale: 32.0 2023-06-26 18:08:15,094 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.607e+02 5.810e+02 7.473e+02 1.088e+03 2.811e+03, threshold=1.495e+03, percent-clipped=19.0 2023-06-26 18:08:26,433 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.06 vs. limit=22.5 2023-06-26 18:09:10,121 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1624782.0, ans=0.0 2023-06-26 18:09:22,012 INFO [train.py:996] (0/4) Epoch 9, batch 26850, loss[loss=0.2208, simple_loss=0.2789, pruned_loss=0.08131, over 21289.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2938, pruned_loss=0.06963, over 4272151.53 frames. ], batch size: 159, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 18:10:43,810 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1625022.0, ans=0.125 2023-06-26 18:10:43,817 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1625022.0, ans=0.0 2023-06-26 18:11:09,551 INFO [train.py:996] (0/4) Epoch 9, batch 26900, loss[loss=0.2, simple_loss=0.2572, pruned_loss=0.07139, over 21508.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2857, pruned_loss=0.06874, over 4274360.78 frames. ], batch size: 442, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 18:11:21,888 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1625142.0, ans=0.2 2023-06-26 18:11:52,557 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.498e+02 4.462e+02 5.999e+02 9.238e+02 1.607e+03, threshold=1.200e+03, percent-clipped=3.0 2023-06-26 18:12:28,355 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1625322.0, ans=0.125 2023-06-26 18:12:48,252 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1625382.0, ans=0.2 2023-06-26 18:12:58,010 INFO [train.py:996] (0/4) Epoch 9, batch 26950, loss[loss=0.2587, simple_loss=0.3435, pruned_loss=0.08696, over 21568.00 frames. ], tot_loss[loss=0.2122, simple_loss=0.2863, pruned_loss=0.06907, over 4272752.56 frames. ], batch size: 441, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 18:13:32,217 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1625502.0, ans=0.2 2023-06-26 18:13:36,035 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.78 vs. limit=15.0 2023-06-26 18:14:05,524 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1625622.0, ans=0.125 2023-06-26 18:14:16,203 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1625622.0, ans=0.0 2023-06-26 18:14:47,865 INFO [train.py:996] (0/4) Epoch 9, batch 27000, loss[loss=0.1835, simple_loss=0.2671, pruned_loss=0.04998, over 21599.00 frames. ], tot_loss[loss=0.2089, simple_loss=0.2855, pruned_loss=0.0662, over 4269399.98 frames. ], batch size: 263, lr: 3.21e-03, grad_scale: 8.0 2023-06-26 18:14:47,866 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-26 18:15:07,480 INFO [train.py:1028] (0/4) Epoch 9, validation: loss=0.2501, simple_loss=0.3419, pruned_loss=0.07919, over 1796401.00 frames. 2023-06-26 18:15:07,481 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23714MB 2023-06-26 18:15:19,915 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.23 vs. limit=15.0 2023-06-26 18:15:31,887 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1625802.0, ans=0.125 2023-06-26 18:15:58,916 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=1625862.0, ans=0.5 2023-06-26 18:15:59,910 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.455e+02 5.551e+02 8.937e+02 1.384e+03 3.879e+03, threshold=1.787e+03, percent-clipped=32.0 2023-06-26 18:16:02,086 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1625862.0, ans=0.035 2023-06-26 18:16:35,139 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.14 vs. limit=12.0 2023-06-26 18:16:57,943 INFO [train.py:996] (0/4) Epoch 9, batch 27050, loss[loss=0.2159, simple_loss=0.3038, pruned_loss=0.06403, over 21636.00 frames. ], tot_loss[loss=0.2078, simple_loss=0.288, pruned_loss=0.0638, over 4274963.08 frames. ], batch size: 263, lr: 3.21e-03, grad_scale: 8.0 2023-06-26 18:17:01,856 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1626042.0, ans=0.125 2023-06-26 18:17:54,221 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 18:17:54,276 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1626162.0, ans=0.2 2023-06-26 18:18:21,231 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1626222.0, ans=0.125 2023-06-26 18:18:30,601 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1626282.0, ans=0.125 2023-06-26 18:18:31,100 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=1626282.0, ans=15.0 2023-06-26 18:18:49,372 INFO [train.py:996] (0/4) Epoch 9, batch 27100, loss[loss=0.2227, simple_loss=0.3116, pruned_loss=0.06687, over 21467.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.29, pruned_loss=0.06458, over 4281654.55 frames. ], batch size: 548, lr: 3.21e-03, grad_scale: 8.0 2023-06-26 18:19:35,899 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.82 vs. limit=15.0 2023-06-26 18:19:37,035 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1626462.0, ans=0.2 2023-06-26 18:19:40,204 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.50 vs. limit=15.0 2023-06-26 18:19:41,575 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=1626462.0, ans=22.5 2023-06-26 18:19:42,256 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.471e+02 6.179e+02 8.599e+02 1.265e+03 2.717e+03, threshold=1.720e+03, percent-clipped=9.0 2023-06-26 18:20:46,707 INFO [train.py:996] (0/4) Epoch 9, batch 27150, loss[loss=0.253, simple_loss=0.3432, pruned_loss=0.0814, over 21855.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.3017, pruned_loss=0.06782, over 4274799.56 frames. ], batch size: 316, lr: 3.21e-03, grad_scale: 8.0 2023-06-26 18:21:19,327 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1626702.0, ans=10.0 2023-06-26 18:21:25,291 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.65 vs. limit=5.0 2023-06-26 18:21:36,897 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1626762.0, ans=0.0 2023-06-26 18:21:52,863 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.87 vs. limit=15.0 2023-06-26 18:21:54,711 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.19 vs. limit=12.0 2023-06-26 18:22:35,005 INFO [train.py:996] (0/4) Epoch 9, batch 27200, loss[loss=0.2359, simple_loss=0.3186, pruned_loss=0.07664, over 21730.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.3093, pruned_loss=0.06994, over 4275489.11 frames. ], batch size: 298, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 18:23:25,801 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.290e+02 5.594e+02 8.054e+02 1.283e+03 2.318e+03, threshold=1.611e+03, percent-clipped=7.0 2023-06-26 18:23:58,401 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1627122.0, ans=0.0 2023-06-26 18:24:30,169 INFO [train.py:996] (0/4) Epoch 9, batch 27250, loss[loss=0.2536, simple_loss=0.3282, pruned_loss=0.08948, over 21801.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.311, pruned_loss=0.07304, over 4273262.10 frames. ], batch size: 441, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 18:24:46,928 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1627302.0, ans=0.125 2023-06-26 18:25:47,273 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1627422.0, ans=0.2 2023-06-26 18:26:14,770 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.83 vs. limit=15.0 2023-06-26 18:26:20,987 INFO [train.py:996] (0/4) Epoch 9, batch 27300, loss[loss=0.1679, simple_loss=0.2406, pruned_loss=0.04764, over 16626.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3122, pruned_loss=0.0743, over 4269895.37 frames. ], batch size: 60, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 18:27:18,627 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.468e+02 5.640e+02 6.768e+02 9.000e+02 1.859e+03, threshold=1.354e+03, percent-clipped=2.0 2023-06-26 18:27:37,385 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.72 vs. limit=15.0 2023-06-26 18:28:04,752 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 18:28:17,663 INFO [train.py:996] (0/4) Epoch 9, batch 27350, loss[loss=0.241, simple_loss=0.3621, pruned_loss=0.05997, over 19873.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3159, pruned_loss=0.07513, over 4259723.36 frames. ], batch size: 702, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 18:28:47,625 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.69 vs. limit=10.0 2023-06-26 18:28:58,981 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1627962.0, ans=0.1 2023-06-26 18:29:35,998 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1628022.0, ans=0.125 2023-06-26 18:30:04,059 INFO [train.py:996] (0/4) Epoch 9, batch 27400, loss[loss=0.2092, simple_loss=0.2772, pruned_loss=0.07059, over 21803.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.3107, pruned_loss=0.07429, over 4264706.79 frames. ], batch size: 371, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 18:30:08,404 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1628142.0, ans=0.125 2023-06-26 18:30:43,474 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1628262.0, ans=0.0 2023-06-26 18:30:54,135 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.725e+02 5.126e+02 6.894e+02 1.011e+03 2.169e+03, threshold=1.379e+03, percent-clipped=11.0 2023-06-26 18:31:52,323 INFO [train.py:996] (0/4) Epoch 9, batch 27450, loss[loss=0.2318, simple_loss=0.3111, pruned_loss=0.07623, over 21880.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.3044, pruned_loss=0.07264, over 4275239.11 frames. ], batch size: 316, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 18:32:32,244 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1628502.0, ans=0.0 2023-06-26 18:32:56,278 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1628562.0, ans=0.2 2023-06-26 18:33:08,353 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1628622.0, ans=0.125 2023-06-26 18:33:38,618 INFO [train.py:996] (0/4) Epoch 9, batch 27500, loss[loss=0.2087, simple_loss=0.2812, pruned_loss=0.06807, over 21531.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.3023, pruned_loss=0.07279, over 4275847.58 frames. ], batch size: 194, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 18:33:57,100 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1628742.0, ans=0.0 2023-06-26 18:34:11,119 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1628802.0, ans=0.2 2023-06-26 18:34:25,544 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.12 vs. limit=15.0 2023-06-26 18:34:29,852 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.928e+02 5.202e+02 7.866e+02 1.174e+03 2.313e+03, threshold=1.573e+03, percent-clipped=14.0 2023-06-26 18:34:43,215 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.83 vs. limit=15.0 2023-06-26 18:34:54,945 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1628922.0, ans=0.0 2023-06-26 18:34:55,680 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.55 vs. limit=15.0 2023-06-26 18:35:17,329 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1628982.0, ans=0.2 2023-06-26 18:35:27,115 INFO [train.py:996] (0/4) Epoch 9, batch 27550, loss[loss=0.2476, simple_loss=0.3036, pruned_loss=0.09581, over 21412.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2966, pruned_loss=0.06962, over 4273478.62 frames. ], batch size: 507, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 18:36:59,706 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1629282.0, ans=0.0 2023-06-26 18:37:07,270 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=1629282.0, ans=15.0 2023-06-26 18:37:21,076 INFO [train.py:996] (0/4) Epoch 9, batch 27600, loss[loss=0.1929, simple_loss=0.2645, pruned_loss=0.06059, over 21839.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.2918, pruned_loss=0.06896, over 4270337.76 frames. ], batch size: 107, lr: 3.21e-03, grad_scale: 32.0 2023-06-26 18:38:11,879 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.490e+02 6.372e+02 8.382e+02 1.316e+03 3.069e+03, threshold=1.676e+03, percent-clipped=15.0 2023-06-26 18:39:08,016 INFO [train.py:996] (0/4) Epoch 9, batch 27650, loss[loss=0.2035, simple_loss=0.2773, pruned_loss=0.06485, over 21817.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2863, pruned_loss=0.06827, over 4260124.14 frames. ], batch size: 351, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 18:39:09,485 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.16 vs. limit=22.5 2023-06-26 18:39:15,325 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1629642.0, ans=0.125 2023-06-26 18:39:16,773 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1629642.0, ans=0.125 2023-06-26 18:39:33,021 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1629702.0, ans=0.0 2023-06-26 18:39:35,967 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.19 vs. limit=22.5 2023-06-26 18:39:36,631 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1629702.0, ans=0.0 2023-06-26 18:39:59,273 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1629762.0, ans=0.07 2023-06-26 18:40:55,822 INFO [train.py:996] (0/4) Epoch 9, batch 27700, loss[loss=0.1927, simple_loss=0.2791, pruned_loss=0.05311, over 21611.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.2875, pruned_loss=0.06718, over 4261342.07 frames. ], batch size: 230, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 18:40:56,439 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1629942.0, ans=0.0 2023-06-26 18:40:58,363 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1629942.0, ans=0.1 2023-06-26 18:41:10,777 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1629942.0, ans=0.1 2023-06-26 18:41:26,613 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.83 vs. limit=12.0 2023-06-26 18:41:27,760 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1630002.0, ans=0.125 2023-06-26 18:41:31,166 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1630002.0, ans=0.2 2023-06-26 18:41:47,712 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.419e+02 4.763e+02 6.253e+02 8.900e+02 1.966e+03, threshold=1.251e+03, percent-clipped=3.0 2023-06-26 18:42:37,185 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1630182.0, ans=0.1 2023-06-26 18:42:43,155 INFO [train.py:996] (0/4) Epoch 9, batch 27750, loss[loss=0.1797, simple_loss=0.2647, pruned_loss=0.04738, over 21161.00 frames. ], tot_loss[loss=0.2118, simple_loss=0.2898, pruned_loss=0.06695, over 4260993.03 frames. ], batch size: 159, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 18:42:52,549 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1630242.0, ans=0.025 2023-06-26 18:43:10,220 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.01 vs. limit=15.0 2023-06-26 18:43:27,081 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1630362.0, ans=0.125 2023-06-26 18:44:23,859 INFO [train.py:996] (0/4) Epoch 9, batch 27800, loss[loss=0.1899, simple_loss=0.2619, pruned_loss=0.05894, over 21586.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2879, pruned_loss=0.06644, over 4273610.33 frames. ], batch size: 195, lr: 3.21e-03, grad_scale: 8.0 2023-06-26 18:44:43,153 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1630542.0, ans=0.125 2023-06-26 18:45:21,858 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1630662.0, ans=0.0 2023-06-26 18:45:23,036 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.595e+02 5.099e+02 6.470e+02 1.005e+03 1.791e+03, threshold=1.294e+03, percent-clipped=14.0 2023-06-26 18:46:18,805 INFO [train.py:996] (0/4) Epoch 9, batch 27850, loss[loss=0.2164, simple_loss=0.2891, pruned_loss=0.07182, over 21870.00 frames. ], tot_loss[loss=0.2118, simple_loss=0.2873, pruned_loss=0.06814, over 4285002.31 frames. ], batch size: 371, lr: 3.21e-03, grad_scale: 8.0 2023-06-26 18:46:44,859 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1630902.0, ans=0.0 2023-06-26 18:47:22,893 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1631022.0, ans=0.125 2023-06-26 18:47:34,688 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1631022.0, ans=0.0 2023-06-26 18:47:34,768 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1631022.0, ans=0.125 2023-06-26 18:47:39,727 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1631022.0, ans=0.0 2023-06-26 18:48:11,026 INFO [train.py:996] (0/4) Epoch 9, batch 27900, loss[loss=0.2156, simple_loss=0.3056, pruned_loss=0.0628, over 21644.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.2968, pruned_loss=0.06979, over 4280751.46 frames. ], batch size: 263, lr: 3.21e-03, grad_scale: 8.0 2023-06-26 18:48:33,502 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_na.min_abs, batch_count=1631142.0, ans=0.02 2023-06-26 18:48:53,068 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1631202.0, ans=0.09899494936611666 2023-06-26 18:49:12,750 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.641e+02 5.533e+02 7.337e+02 1.067e+03 2.093e+03, threshold=1.467e+03, percent-clipped=13.0 2023-06-26 18:49:56,897 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.76 vs. limit=12.0 2023-06-26 18:50:03,396 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.89 vs. limit=12.0 2023-06-26 18:50:09,133 INFO [train.py:996] (0/4) Epoch 9, batch 27950, loss[loss=0.2117, simple_loss=0.3104, pruned_loss=0.05652, over 21737.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2968, pruned_loss=0.06693, over 4283493.00 frames. ], batch size: 351, lr: 3.21e-03, grad_scale: 8.0 2023-06-26 18:50:42,544 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1631502.0, ans=0.125 2023-06-26 18:51:30,381 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1631622.0, ans=0.125 2023-06-26 18:51:32,492 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1631622.0, ans=0.125 2023-06-26 18:51:52,477 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1631682.0, ans=0.2 2023-06-26 18:51:53,211 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.32 vs. limit=15.0 2023-06-26 18:51:55,813 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1631682.0, ans=0.125 2023-06-26 18:51:58,505 INFO [train.py:996] (0/4) Epoch 9, batch 28000, loss[loss=0.1857, simple_loss=0.2677, pruned_loss=0.05186, over 21407.00 frames. ], tot_loss[loss=0.213, simple_loss=0.2949, pruned_loss=0.06555, over 4283161.35 frames. ], batch size: 194, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 18:52:08,342 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 18:52:12,170 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.16 vs. limit=12.0 2023-06-26 18:52:49,979 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.10 vs. limit=15.0 2023-06-26 18:52:53,931 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.093e+02 5.535e+02 9.213e+02 1.280e+03 3.629e+03, threshold=1.843e+03, percent-clipped=20.0 2023-06-26 18:53:01,973 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1631922.0, ans=0.04949747468305833 2023-06-26 18:53:07,188 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1631922.0, ans=0.125 2023-06-26 18:53:30,867 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-272000.pt 2023-06-26 18:53:56,230 INFO [train.py:996] (0/4) Epoch 9, batch 28050, loss[loss=0.1915, simple_loss=0.2748, pruned_loss=0.0541, over 21858.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2942, pruned_loss=0.06737, over 4285546.57 frames. ], batch size: 316, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 18:54:46,322 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1632162.0, ans=0.0 2023-06-26 18:54:56,914 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.92 vs. limit=15.0 2023-06-26 18:54:56,954 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.93 vs. limit=6.0 2023-06-26 18:55:17,688 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.68 vs. limit=15.0 2023-06-26 18:55:44,949 INFO [train.py:996] (0/4) Epoch 9, batch 28100, loss[loss=0.1858, simple_loss=0.2461, pruned_loss=0.06278, over 21559.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2912, pruned_loss=0.06709, over 4285092.43 frames. ], batch size: 195, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 18:55:57,131 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1632342.0, ans=0.125 2023-06-26 18:56:13,712 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1632402.0, ans=0.2 2023-06-26 18:56:15,493 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=1632402.0, ans=0.025 2023-06-26 18:56:21,010 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.72 vs. limit=22.5 2023-06-26 18:56:37,149 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.724e+02 5.263e+02 6.694e+02 1.046e+03 2.729e+03, threshold=1.339e+03, percent-clipped=5.0 2023-06-26 18:56:45,292 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1632522.0, ans=0.0 2023-06-26 18:57:22,143 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1632582.0, ans=0.125 2023-06-26 18:57:23,502 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1632582.0, ans=0.0 2023-06-26 18:57:25,133 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1632582.0, ans=0.125 2023-06-26 18:57:29,625 INFO [train.py:996] (0/4) Epoch 9, batch 28150, loss[loss=0.1847, simple_loss=0.2543, pruned_loss=0.05758, over 21440.00 frames. ], tot_loss[loss=0.2095, simple_loss=0.285, pruned_loss=0.06701, over 4283876.49 frames. ], batch size: 389, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 18:57:59,003 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1632702.0, ans=0.0 2023-06-26 18:58:00,772 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1632702.0, ans=0.125 2023-06-26 18:58:02,447 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1632702.0, ans=0.125 2023-06-26 18:59:08,822 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1632882.0, ans=0.125 2023-06-26 18:59:18,533 INFO [train.py:996] (0/4) Epoch 9, batch 28200, loss[loss=0.2581, simple_loss=0.3222, pruned_loss=0.09701, over 21571.00 frames. ], tot_loss[loss=0.21, simple_loss=0.2846, pruned_loss=0.06771, over 4280877.71 frames. ], batch size: 414, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 18:59:21,304 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1632942.0, ans=0.125 2023-06-26 18:59:27,789 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1632942.0, ans=0.125 2023-06-26 18:59:52,735 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1633002.0, ans=0.1 2023-06-26 19:00:13,419 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.542e+02 6.148e+02 9.394e+02 1.401e+03 3.381e+03, threshold=1.879e+03, percent-clipped=30.0 2023-06-26 19:00:27,123 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1633122.0, ans=0.125 2023-06-26 19:00:28,989 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1633122.0, ans=0.2 2023-06-26 19:01:07,263 INFO [train.py:996] (0/4) Epoch 9, batch 28250, loss[loss=0.1712, simple_loss=0.2197, pruned_loss=0.06135, over 20859.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2866, pruned_loss=0.07022, over 4277756.20 frames. ], batch size: 609, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 19:01:48,732 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1633302.0, ans=0.125 2023-06-26 19:02:17,753 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.30 vs. limit=22.5 2023-06-26 19:02:29,488 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.02 vs. limit=6.0 2023-06-26 19:02:49,878 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1633482.0, ans=0.0 2023-06-26 19:03:02,953 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1633542.0, ans=0.125 2023-06-26 19:03:03,816 INFO [train.py:996] (0/4) Epoch 9, batch 28300, loss[loss=0.1776, simple_loss=0.2786, pruned_loss=0.0383, over 21718.00 frames. ], tot_loss[loss=0.2107, simple_loss=0.2846, pruned_loss=0.06838, over 4273348.05 frames. ], batch size: 332, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 19:03:11,646 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1633542.0, ans=0.0 2023-06-26 19:03:45,500 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1633602.0, ans=10.0 2023-06-26 19:03:48,838 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1633662.0, ans=0.0 2023-06-26 19:03:52,748 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1633662.0, ans=0.0 2023-06-26 19:03:55,041 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.46 vs. limit=15.0 2023-06-26 19:03:58,617 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.505e+02 4.596e+02 7.876e+02 1.186e+03 2.671e+03, threshold=1.575e+03, percent-clipped=4.0 2023-06-26 19:04:07,507 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.45 vs. limit=22.5 2023-06-26 19:04:53,325 INFO [train.py:996] (0/4) Epoch 9, batch 28350, loss[loss=0.2427, simple_loss=0.327, pruned_loss=0.07923, over 21413.00 frames. ], tot_loss[loss=0.2042, simple_loss=0.2816, pruned_loss=0.06343, over 4275221.53 frames. ], batch size: 507, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 19:05:36,745 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1633962.0, ans=0.125 2023-06-26 19:05:51,002 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1633962.0, ans=0.125 2023-06-26 19:06:27,488 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1634082.0, ans=0.125 2023-06-26 19:06:46,270 INFO [train.py:996] (0/4) Epoch 9, batch 28400, loss[loss=0.1934, simple_loss=0.2673, pruned_loss=0.05978, over 21640.00 frames. ], tot_loss[loss=0.2028, simple_loss=0.2791, pruned_loss=0.06325, over 4266618.19 frames. ], batch size: 263, lr: 3.21e-03, grad_scale: 32.0 2023-06-26 19:07:41,787 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.757e+02 5.507e+02 7.639e+02 1.116e+03 2.582e+03, threshold=1.528e+03, percent-clipped=10.0 2023-06-26 19:07:47,459 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=1634322.0, ans=0.05 2023-06-26 19:08:33,699 INFO [train.py:996] (0/4) Epoch 9, batch 28450, loss[loss=0.2305, simple_loss=0.2994, pruned_loss=0.08074, over 21820.00 frames. ], tot_loss[loss=0.2077, simple_loss=0.2834, pruned_loss=0.06605, over 4267000.79 frames. ], batch size: 282, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:08:35,813 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1634442.0, ans=0.0 2023-06-26 19:09:28,768 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1634562.0, ans=0.125 2023-06-26 19:10:22,652 INFO [train.py:996] (0/4) Epoch 9, batch 28500, loss[loss=0.2236, simple_loss=0.2954, pruned_loss=0.0759, over 21811.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2857, pruned_loss=0.06857, over 4280910.54 frames. ], batch size: 247, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:11:03,829 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.65 vs. limit=15.0 2023-06-26 19:11:12,899 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.47 vs. limit=15.0 2023-06-26 19:11:14,030 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1634862.0, ans=0.125 2023-06-26 19:11:15,922 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1634862.0, ans=0.2 2023-06-26 19:11:20,224 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.065e+02 5.079e+02 6.899e+02 9.776e+02 2.125e+03, threshold=1.380e+03, percent-clipped=6.0 2023-06-26 19:12:18,051 INFO [train.py:996] (0/4) Epoch 9, batch 28550, loss[loss=0.256, simple_loss=0.3545, pruned_loss=0.07877, over 21882.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.2947, pruned_loss=0.07192, over 4287334.48 frames. ], batch size: 372, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:13:00,379 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.20 vs. limit=22.5 2023-06-26 19:13:19,029 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 19:13:38,204 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1635222.0, ans=0.125 2023-06-26 19:14:06,454 INFO [train.py:996] (0/4) Epoch 9, batch 28600, loss[loss=0.2218, simple_loss=0.3007, pruned_loss=0.07144, over 21708.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.3016, pruned_loss=0.07423, over 4289547.98 frames. ], batch size: 298, lr: 3.20e-03, grad_scale: 8.0 2023-06-26 19:14:09,961 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=1635342.0, ans=10.0 2023-06-26 19:15:10,706 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.780e+02 5.299e+02 6.853e+02 1.013e+03 2.004e+03, threshold=1.371e+03, percent-clipped=8.0 2023-06-26 19:15:15,615 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.90 vs. limit=22.5 2023-06-26 19:15:27,182 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1635522.0, ans=0.125 2023-06-26 19:15:41,458 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1635582.0, ans=0.0 2023-06-26 19:16:02,351 INFO [train.py:996] (0/4) Epoch 9, batch 28650, loss[loss=0.227, simple_loss=0.2945, pruned_loss=0.0798, over 20050.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.2968, pruned_loss=0.07366, over 4282941.32 frames. ], batch size: 702, lr: 3.20e-03, grad_scale: 8.0 2023-06-26 19:16:38,565 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1635702.0, ans=0.125 2023-06-26 19:16:55,830 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.84 vs. limit=22.5 2023-06-26 19:17:38,506 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1635882.0, ans=0.05 2023-06-26 19:17:50,895 INFO [train.py:996] (0/4) Epoch 9, batch 28700, loss[loss=0.1782, simple_loss=0.223, pruned_loss=0.06671, over 20022.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.2945, pruned_loss=0.07393, over 4283512.31 frames. ], batch size: 702, lr: 3.20e-03, grad_scale: 8.0 2023-06-26 19:18:01,696 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1635942.0, ans=0.1 2023-06-26 19:18:06,593 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1636002.0, ans=0.0 2023-06-26 19:18:48,051 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.582e+02 5.345e+02 7.889e+02 1.390e+03 2.918e+03, threshold=1.578e+03, percent-clipped=26.0 2023-06-26 19:19:07,686 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1636122.0, ans=0.0 2023-06-26 19:19:09,206 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1636122.0, ans=0.125 2023-06-26 19:19:40,200 INFO [train.py:996] (0/4) Epoch 9, batch 28750, loss[loss=0.2173, simple_loss=0.2887, pruned_loss=0.07299, over 21934.00 frames. ], tot_loss[loss=0.222, simple_loss=0.2954, pruned_loss=0.07433, over 4279853.40 frames. ], batch size: 316, lr: 3.20e-03, grad_scale: 8.0 2023-06-26 19:20:01,145 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1636242.0, ans=0.125 2023-06-26 19:20:28,063 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_na.min_abs, batch_count=1636362.0, ans=0.02 2023-06-26 19:20:56,563 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1636422.0, ans=0.2 2023-06-26 19:21:14,438 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1636482.0, ans=0.125 2023-06-26 19:21:31,209 INFO [train.py:996] (0/4) Epoch 9, batch 28800, loss[loss=0.2811, simple_loss=0.3523, pruned_loss=0.105, over 21384.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.2986, pruned_loss=0.07455, over 4277541.94 frames. ], batch size: 159, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:21:51,355 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1636542.0, ans=0.125 2023-06-26 19:21:54,909 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1636602.0, ans=0.015 2023-06-26 19:22:02,169 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.55 vs. limit=22.5 2023-06-26 19:22:32,159 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=1636662.0, ans=0.5 2023-06-26 19:22:33,370 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.768e+02 5.031e+02 6.250e+02 8.713e+02 2.260e+03, threshold=1.250e+03, percent-clipped=3.0 2023-06-26 19:22:35,552 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1636662.0, ans=0.1 2023-06-26 19:22:37,495 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1636722.0, ans=0.1 2023-06-26 19:22:39,384 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1636722.0, ans=0.0 2023-06-26 19:22:50,774 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.43 vs. limit=15.0 2023-06-26 19:23:25,613 INFO [train.py:996] (0/4) Epoch 9, batch 28850, loss[loss=0.205, simple_loss=0.2759, pruned_loss=0.06704, over 21843.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.2993, pruned_loss=0.07574, over 4284501.85 frames. ], batch size: 247, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:24:24,170 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1636962.0, ans=0.125 2023-06-26 19:24:42,405 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1637022.0, ans=0.125 2023-06-26 19:24:45,874 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1637022.0, ans=0.125 2023-06-26 19:25:14,966 INFO [train.py:996] (0/4) Epoch 9, batch 28900, loss[loss=0.3074, simple_loss=0.3677, pruned_loss=0.1236, over 21448.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3038, pruned_loss=0.07731, over 4281620.10 frames. ], batch size: 507, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:25:15,531 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1637142.0, ans=0.125 2023-06-26 19:25:40,258 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.49 vs. limit=15.0 2023-06-26 19:25:57,855 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1637202.0, ans=0.1 2023-06-26 19:26:18,364 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.963e+02 5.862e+02 9.485e+02 1.263e+03 2.647e+03, threshold=1.897e+03, percent-clipped=25.0 2023-06-26 19:26:33,689 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1637322.0, ans=0.0 2023-06-26 19:26:57,754 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1637382.0, ans=0.1 2023-06-26 19:27:10,763 INFO [train.py:996] (0/4) Epoch 9, batch 28950, loss[loss=0.2115, simple_loss=0.31, pruned_loss=0.0565, over 21767.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.3042, pruned_loss=0.07614, over 4277877.29 frames. ], batch size: 332, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:28:17,462 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.71 vs. limit=15.0 2023-06-26 19:28:45,487 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1637682.0, ans=0.125 2023-06-26 19:29:07,343 INFO [train.py:996] (0/4) Epoch 9, batch 29000, loss[loss=0.2384, simple_loss=0.3152, pruned_loss=0.08084, over 21253.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3056, pruned_loss=0.07503, over 4275669.14 frames. ], batch size: 143, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:29:15,597 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.31 vs. limit=12.0 2023-06-26 19:30:02,153 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.689e+02 5.853e+02 8.491e+02 1.284e+03 2.472e+03, threshold=1.698e+03, percent-clipped=8.0 2023-06-26 19:30:04,121 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1637862.0, ans=0.125 2023-06-26 19:30:57,306 INFO [train.py:996] (0/4) Epoch 9, batch 29050, loss[loss=0.2364, simple_loss=0.3004, pruned_loss=0.08619, over 21885.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.3047, pruned_loss=0.07585, over 4284276.14 frames. ], batch size: 414, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:31:06,433 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 19:31:21,406 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1638102.0, ans=0.1 2023-06-26 19:31:27,971 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1638102.0, ans=0.0 2023-06-26 19:32:46,723 INFO [train.py:996] (0/4) Epoch 9, batch 29100, loss[loss=0.1884, simple_loss=0.2461, pruned_loss=0.06539, over 21189.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.296, pruned_loss=0.07351, over 4284306.24 frames. ], batch size: 159, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:33:44,927 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.478e+02 5.318e+02 7.274e+02 9.701e+02 2.233e+03, threshold=1.455e+03, percent-clipped=4.0 2023-06-26 19:34:27,989 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=16.15 vs. limit=22.5 2023-06-26 19:34:35,014 INFO [train.py:996] (0/4) Epoch 9, batch 29150, loss[loss=0.2315, simple_loss=0.3215, pruned_loss=0.07075, over 21698.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.2934, pruned_loss=0.07149, over 4287419.60 frames. ], batch size: 332, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:34:51,167 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1638642.0, ans=0.125 2023-06-26 19:34:53,593 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.56 vs. limit=6.0 2023-06-26 19:34:54,492 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1638642.0, ans=0.125 2023-06-26 19:35:16,924 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1638762.0, ans=0.1 2023-06-26 19:35:29,161 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1638762.0, ans=0.0 2023-06-26 19:35:30,912 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1638762.0, ans=0.5 2023-06-26 19:35:30,928 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1638762.0, ans=0.2 2023-06-26 19:35:48,619 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1638822.0, ans=0.0 2023-06-26 19:35:56,106 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1638822.0, ans=0.0 2023-06-26 19:36:23,260 INFO [train.py:996] (0/4) Epoch 9, batch 29200, loss[loss=0.2005, simple_loss=0.2709, pruned_loss=0.06506, over 21571.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2893, pruned_loss=0.0706, over 4283258.02 frames. ], batch size: 414, lr: 3.20e-03, grad_scale: 32.0 2023-06-26 19:36:40,407 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1638942.0, ans=0.0 2023-06-26 19:36:51,012 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1639002.0, ans=0.125 2023-06-26 19:37:28,582 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.809e+02 5.329e+02 8.203e+02 1.175e+03 2.946e+03, threshold=1.641e+03, percent-clipped=12.0 2023-06-26 19:38:07,699 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.01 vs. limit=22.5 2023-06-26 19:38:10,761 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1639242.0, ans=0.1 2023-06-26 19:38:11,691 INFO [train.py:996] (0/4) Epoch 9, batch 29250, loss[loss=0.1853, simple_loss=0.2634, pruned_loss=0.0536, over 21563.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2891, pruned_loss=0.06908, over 4284910.58 frames. ], batch size: 263, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:38:14,664 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.07 vs. limit=15.0 2023-06-26 19:38:23,752 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.12 vs. limit=6.0 2023-06-26 19:38:24,752 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1639242.0, ans=0.125 2023-06-26 19:40:04,143 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1639542.0, ans=0.0 2023-06-26 19:40:05,102 INFO [train.py:996] (0/4) Epoch 9, batch 29300, loss[loss=0.2, simple_loss=0.2754, pruned_loss=0.06229, over 21757.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.2901, pruned_loss=0.06805, over 4269482.69 frames. ], batch size: 351, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:40:51,965 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.09 vs. limit=15.0 2023-06-26 19:41:03,790 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.751e+02 5.530e+02 7.690e+02 1.193e+03 2.293e+03, threshold=1.538e+03, percent-clipped=8.0 2023-06-26 19:41:08,651 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1639722.0, ans=0.0 2023-06-26 19:41:21,150 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1639722.0, ans=0.0 2023-06-26 19:41:31,065 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1639782.0, ans=0.1 2023-06-26 19:41:32,972 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1639782.0, ans=0.125 2023-06-26 19:41:40,618 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1639782.0, ans=0.125 2023-06-26 19:41:55,419 INFO [train.py:996] (0/4) Epoch 9, batch 29350, loss[loss=0.2093, simple_loss=0.3078, pruned_loss=0.05541, over 20912.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.2867, pruned_loss=0.06752, over 4268115.13 frames. ], batch size: 609, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:42:26,075 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1639902.0, ans=0.125 2023-06-26 19:43:22,541 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1640082.0, ans=0.125 2023-06-26 19:43:26,610 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1640082.0, ans=0.125 2023-06-26 19:43:43,342 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1640082.0, ans=0.0 2023-06-26 19:43:47,629 INFO [train.py:996] (0/4) Epoch 9, batch 29400, loss[loss=0.1732, simple_loss=0.2417, pruned_loss=0.05239, over 21601.00 frames. ], tot_loss[loss=0.2088, simple_loss=0.2861, pruned_loss=0.06578, over 4259426.16 frames. ], batch size: 195, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:44:16,398 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.33 vs. limit=15.0 2023-06-26 19:44:36,453 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1640262.0, ans=0.125 2023-06-26 19:44:53,340 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.711e+02 5.689e+02 1.066e+03 1.595e+03 4.259e+03, threshold=2.132e+03, percent-clipped=27.0 2023-06-26 19:45:21,396 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1640382.0, ans=0.0 2023-06-26 19:45:26,990 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1640382.0, ans=0.0 2023-06-26 19:45:44,129 INFO [train.py:996] (0/4) Epoch 9, batch 29450, loss[loss=0.1834, simple_loss=0.283, pruned_loss=0.04185, over 20897.00 frames. ], tot_loss[loss=0.2081, simple_loss=0.2861, pruned_loss=0.06508, over 4260521.11 frames. ], batch size: 609, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:46:22,689 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1640562.0, ans=0.07 2023-06-26 19:46:28,263 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.44 vs. limit=10.0 2023-06-26 19:46:44,945 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1640622.0, ans=0.0 2023-06-26 19:47:08,409 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1640682.0, ans=0.0 2023-06-26 19:47:26,965 INFO [train.py:996] (0/4) Epoch 9, batch 29500, loss[loss=0.1985, simple_loss=0.2753, pruned_loss=0.0609, over 21809.00 frames. ], tot_loss[loss=0.2129, simple_loss=0.2899, pruned_loss=0.06788, over 4264423.51 frames. ], batch size: 112, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:47:42,140 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1640742.0, ans=0.2 2023-06-26 19:47:51,963 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1640802.0, ans=0.125 2023-06-26 19:48:12,395 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1640862.0, ans=0.125 2023-06-26 19:48:21,596 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.12 vs. limit=15.0 2023-06-26 19:48:30,388 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.856e+02 6.070e+02 8.083e+02 1.104e+03 1.958e+03, threshold=1.617e+03, percent-clipped=0.0 2023-06-26 19:48:31,857 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.75 vs. limit=22.5 2023-06-26 19:48:38,252 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1640922.0, ans=0.125 2023-06-26 19:48:48,567 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1640922.0, ans=0.125 2023-06-26 19:49:14,767 INFO [train.py:996] (0/4) Epoch 9, batch 29550, loss[loss=0.1919, simple_loss=0.2429, pruned_loss=0.07039, over 20312.00 frames. ], tot_loss[loss=0.214, simple_loss=0.289, pruned_loss=0.0695, over 4275410.71 frames. ], batch size: 703, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:50:41,857 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1641222.0, ans=0.125 2023-06-26 19:51:11,589 INFO [train.py:996] (0/4) Epoch 9, batch 29600, loss[loss=0.2259, simple_loss=0.3144, pruned_loss=0.06865, over 21715.00 frames. ], tot_loss[loss=0.22, simple_loss=0.2958, pruned_loss=0.07214, over 4278972.06 frames. ], batch size: 247, lr: 3.20e-03, grad_scale: 32.0 2023-06-26 19:51:17,641 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1641342.0, ans=0.2 2023-06-26 19:51:41,551 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1641402.0, ans=0.0 2023-06-26 19:52:16,244 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.429e+02 6.278e+02 9.739e+02 1.305e+03 2.412e+03, threshold=1.948e+03, percent-clipped=12.0 2023-06-26 19:53:00,026 INFO [train.py:996] (0/4) Epoch 9, batch 29650, loss[loss=0.2397, simple_loss=0.3039, pruned_loss=0.08773, over 21658.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2958, pruned_loss=0.07, over 4270145.41 frames. ], batch size: 473, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:53:45,805 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1641762.0, ans=0.0 2023-06-26 19:54:49,514 INFO [train.py:996] (0/4) Epoch 9, batch 29700, loss[loss=0.2246, simple_loss=0.3113, pruned_loss=0.06895, over 21415.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.295, pruned_loss=0.06977, over 4274087.78 frames. ], batch size: 548, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:54:51,518 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1641942.0, ans=0.125 2023-06-26 19:55:36,554 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 19:55:37,043 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.57 vs. limit=12.0 2023-06-26 19:55:55,276 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.514e+02 4.988e+02 7.625e+02 1.121e+03 2.201e+03, threshold=1.525e+03, percent-clipped=1.0 2023-06-26 19:56:08,221 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1642122.0, ans=0.0 2023-06-26 19:56:33,843 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1642182.0, ans=0.125 2023-06-26 19:56:38,118 INFO [train.py:996] (0/4) Epoch 9, batch 29750, loss[loss=0.2242, simple_loss=0.3132, pruned_loss=0.06762, over 21666.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.2996, pruned_loss=0.06947, over 4274590.11 frames. ], batch size: 263, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:56:49,367 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1642242.0, ans=0.2 2023-06-26 19:57:14,709 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1642302.0, ans=0.0 2023-06-26 19:57:14,753 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1642302.0, ans=0.2 2023-06-26 19:57:39,773 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1642362.0, ans=0.125 2023-06-26 19:57:53,571 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1642422.0, ans=0.0 2023-06-26 19:58:26,772 INFO [train.py:996] (0/4) Epoch 9, batch 29800, loss[loss=0.2189, simple_loss=0.3224, pruned_loss=0.05773, over 19840.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.3011, pruned_loss=0.06957, over 4274167.18 frames. ], batch size: 703, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:59:11,529 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.99 vs. limit=15.0 2023-06-26 19:59:23,859 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1642662.0, ans=0.125 2023-06-26 19:59:33,430 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.208e+02 7.577e+02 1.107e+03 1.626e+03 2.906e+03, threshold=2.213e+03, percent-clipped=30.0 2023-06-26 19:59:42,981 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1642722.0, ans=0.1 2023-06-26 20:00:03,404 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1642782.0, ans=0.2 2023-06-26 20:00:15,163 INFO [train.py:996] (0/4) Epoch 9, batch 29850, loss[loss=0.1892, simple_loss=0.2724, pruned_loss=0.05302, over 21804.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.2972, pruned_loss=0.06723, over 4273013.41 frames. ], batch size: 298, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 20:01:46,229 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1643082.0, ans=0.0 2023-06-26 20:02:05,819 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=6.68 vs. limit=22.5 2023-06-26 20:02:08,082 INFO [train.py:996] (0/4) Epoch 9, batch 29900, loss[loss=0.2724, simple_loss=0.3252, pruned_loss=0.1098, over 21628.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2947, pruned_loss=0.06788, over 4279383.69 frames. ], batch size: 471, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 20:02:15,097 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1643142.0, ans=0.2 2023-06-26 20:02:18,793 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1643142.0, ans=0.125 2023-06-26 20:02:40,005 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1643202.0, ans=0.1 2023-06-26 20:02:45,371 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1643202.0, ans=0.2 2023-06-26 20:03:09,778 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.950e+02 5.576e+02 8.031e+02 1.172e+03 2.675e+03, threshold=1.606e+03, percent-clipped=3.0 2023-06-26 20:03:57,897 INFO [train.py:996] (0/4) Epoch 9, batch 29950, loss[loss=0.2218, simple_loss=0.2948, pruned_loss=0.07445, over 21771.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.2971, pruned_loss=0.07113, over 4283192.10 frames. ], batch size: 332, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 20:04:02,810 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.06 vs. limit=22.5 2023-06-26 20:04:31,027 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.63 vs. limit=10.0 2023-06-26 20:04:55,596 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1643562.0, ans=0.125 2023-06-26 20:05:09,151 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.67 vs. limit=15.0 2023-06-26 20:05:11,572 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1643622.0, ans=0.125 2023-06-26 20:05:38,267 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1643682.0, ans=0.0 2023-06-26 20:05:49,907 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1643682.0, ans=0.125 2023-06-26 20:05:54,897 INFO [train.py:996] (0/4) Epoch 9, batch 30000, loss[loss=0.1501, simple_loss=0.2075, pruned_loss=0.04637, over 16825.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.2978, pruned_loss=0.07124, over 4277936.99 frames. ], batch size: 61, lr: 3.20e-03, grad_scale: 32.0 2023-06-26 20:05:54,899 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-26 20:06:07,939 INFO [zipformer.py:1728] (0/4) name=encoder.encoders.3.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([2.3529, 3.1990, 3.3667, 3.5225, 3.0595, 2.9241, 3.5772, 3.5374], device='cuda:0') 2023-06-26 20:06:13,532 INFO [zipformer.py:1728] (0/4) name=encoder.encoders.2.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([1.9743, 2.4979, 4.0039, 2.9475], device='cuda:0') 2023-06-26 20:06:15,949 INFO [train.py:1028] (0/4) Epoch 9, validation: loss=0.2518, simple_loss=0.3443, pruned_loss=0.07961, over 1796401.00 frames. 2023-06-26 20:06:15,950 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23714MB 2023-06-26 20:06:24,675 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1643742.0, ans=0.125 2023-06-26 20:06:25,175 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.18 vs. limit=15.0 2023-06-26 20:07:08,060 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1643862.0, ans=0.07 2023-06-26 20:07:22,159 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.541e+02 6.693e+02 9.863e+02 1.324e+03 2.517e+03, threshold=1.973e+03, percent-clipped=14.0 2023-06-26 20:07:23,017 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1643922.0, ans=0.1 2023-06-26 20:07:36,018 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.26 vs. limit=15.0 2023-06-26 20:07:43,107 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1643922.0, ans=0.09899494936611666 2023-06-26 20:07:43,118 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1643922.0, ans=0.1 2023-06-26 20:07:44,835 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1643982.0, ans=0.0 2023-06-26 20:08:09,886 INFO [train.py:996] (0/4) Epoch 9, batch 30050, loss[loss=0.165, simple_loss=0.2451, pruned_loss=0.04249, over 21840.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.3003, pruned_loss=0.06814, over 4273602.15 frames. ], batch size: 118, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 20:08:12,226 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1644042.0, ans=0.0 2023-06-26 20:08:30,005 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1644042.0, ans=0.0 2023-06-26 20:08:37,025 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1644102.0, ans=0.125 2023-06-26 20:09:48,934 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1644282.0, ans=0.2 2023-06-26 20:10:03,821 INFO [train.py:996] (0/4) Epoch 9, batch 30100, loss[loss=0.1991, simple_loss=0.2641, pruned_loss=0.06701, over 21717.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.3007, pruned_loss=0.06837, over 4278879.18 frames. ], batch size: 299, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 20:11:07,511 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.020e+02 5.633e+02 9.341e+02 1.482e+03 2.871e+03, threshold=1.868e+03, percent-clipped=12.0 2023-06-26 20:11:22,390 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1644522.0, ans=0.2 2023-06-26 20:11:24,626 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.94 vs. limit=22.5 2023-06-26 20:11:53,758 INFO [train.py:996] (0/4) Epoch 9, batch 30150, loss[loss=0.2443, simple_loss=0.3121, pruned_loss=0.08821, over 21349.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.2979, pruned_loss=0.06971, over 4279607.90 frames. ], batch size: 176, lr: 3.19e-03, grad_scale: 16.0 2023-06-26 20:12:01,647 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1644642.0, ans=0.0 2023-06-26 20:12:32,434 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=1644702.0, ans=0.025 2023-06-26 20:12:40,026 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.38 vs. limit=15.0 2023-06-26 20:13:24,909 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1644822.0, ans=0.0 2023-06-26 20:13:49,768 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1644942.0, ans=0.125 2023-06-26 20:13:50,908 INFO [train.py:996] (0/4) Epoch 9, batch 30200, loss[loss=0.23, simple_loss=0.3253, pruned_loss=0.06733, over 21322.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.2996, pruned_loss=0.06926, over 4271378.67 frames. ], batch size: 549, lr: 3.19e-03, grad_scale: 16.0 2023-06-26 20:13:57,442 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1644942.0, ans=0.125 2023-06-26 20:14:04,446 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1644942.0, ans=0.1 2023-06-26 20:14:35,529 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.39 vs. limit=6.0 2023-06-26 20:14:37,272 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.22 vs. limit=10.0 2023-06-26 20:15:01,602 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.601e+02 6.016e+02 8.945e+02 1.496e+03 2.296e+03, threshold=1.789e+03, percent-clipped=11.0 2023-06-26 20:15:22,992 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1645182.0, ans=0.125 2023-06-26 20:15:23,048 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1645182.0, ans=0.1 2023-06-26 20:15:23,741 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.40 vs. limit=15.0 2023-06-26 20:15:42,559 INFO [train.py:996] (0/4) Epoch 9, batch 30250, loss[loss=0.2222, simple_loss=0.3125, pruned_loss=0.06594, over 21787.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.3088, pruned_loss=0.07186, over 4273921.64 frames. ], batch size: 124, lr: 3.19e-03, grad_scale: 16.0 2023-06-26 20:15:53,691 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.86 vs. limit=15.0 2023-06-26 20:16:45,766 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.62 vs. limit=15.0 2023-06-26 20:16:54,015 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1645422.0, ans=0.125 2023-06-26 20:17:37,248 INFO [train.py:996] (0/4) Epoch 9, batch 30300, loss[loss=0.1817, simple_loss=0.2469, pruned_loss=0.05827, over 21208.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.3069, pruned_loss=0.07171, over 4274710.96 frames. ], batch size: 549, lr: 3.19e-03, grad_scale: 16.0 2023-06-26 20:17:50,646 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1645542.0, ans=0.125 2023-06-26 20:17:51,271 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.47 vs. limit=10.0 2023-06-26 20:18:47,517 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.854e+02 6.241e+02 9.150e+02 1.357e+03 2.520e+03, threshold=1.830e+03, percent-clipped=12.0 2023-06-26 20:19:35,130 INFO [train.py:996] (0/4) Epoch 9, batch 30350, loss[loss=0.258, simple_loss=0.3663, pruned_loss=0.07486, over 20712.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.3064, pruned_loss=0.07309, over 4270536.84 frames. ], batch size: 607, lr: 3.19e-03, grad_scale: 16.0 2023-06-26 20:20:12,428 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1645962.0, ans=0.2 2023-06-26 20:20:13,977 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1645962.0, ans=0.125 2023-06-26 20:20:58,742 INFO [train.py:996] (0/4) Epoch 9, batch 30400, loss[loss=0.2013, simple_loss=0.2536, pruned_loss=0.07453, over 20189.00 frames. ], tot_loss[loss=0.223, simple_loss=0.302, pruned_loss=0.072, over 4251728.59 frames. ], batch size: 703, lr: 3.19e-03, grad_scale: 32.0 2023-06-26 20:21:05,557 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1646142.0, ans=0.5 2023-06-26 20:21:37,859 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1646262.0, ans=0.07 2023-06-26 20:21:50,076 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 20:21:55,166 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.123e+02 6.385e+02 9.749e+02 1.472e+03 9.200e+03, threshold=1.950e+03, percent-clipped=15.0 2023-06-26 20:22:22,733 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1646382.0, ans=0.2 2023-06-26 20:22:29,104 INFO [train.py:996] (0/4) Epoch 9, batch 30450, loss[loss=0.25, simple_loss=0.3458, pruned_loss=0.07704, over 19854.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.3015, pruned_loss=0.0714, over 4194192.06 frames. ], batch size: 702, lr: 3.19e-03, grad_scale: 16.0 2023-06-26 20:23:01,683 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1646562.0, ans=0.0 2023-06-26 20:23:41,201 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/epoch-9.pt 2023-06-26 20:25:55,225 INFO [train.py:996] (0/4) Epoch 10, batch 0, loss[loss=0.1898, simple_loss=0.2585, pruned_loss=0.06058, over 21613.00 frames. ], tot_loss[loss=0.1898, simple_loss=0.2585, pruned_loss=0.06058, over 21613.00 frames. ], batch size: 264, lr: 3.02e-03, grad_scale: 32.0 2023-06-26 20:25:55,227 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-26 20:26:11,822 INFO [train.py:1028] (0/4) Epoch 10, validation: loss=0.2437, simple_loss=0.3472, pruned_loss=0.0701, over 1796401.00 frames. 2023-06-26 20:26:11,823 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23714MB 2023-06-26 20:26:42,888 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff2.min_abs, batch_count=1646772.0, ans=0.1 2023-06-26 20:27:03,691 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1646832.0, ans=0.0 2023-06-26 20:27:27,463 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1646892.0, ans=0.0 2023-06-26 20:27:35,020 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.973e+02 1.183e+03 2.082e+03 3.728e+03 9.226e+03, threshold=4.165e+03, percent-clipped=55.0 2023-06-26 20:27:39,472 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1646952.0, ans=0.0 2023-06-26 20:27:57,603 INFO [train.py:996] (0/4) Epoch 10, batch 50, loss[loss=0.3069, simple_loss=0.3746, pruned_loss=0.1196, over 21349.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.3019, pruned_loss=0.07261, over 953168.16 frames. ], batch size: 507, lr: 3.02e-03, grad_scale: 16.0 2023-06-26 20:28:08,889 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1647012.0, ans=0.125 2023-06-26 20:28:10,462 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1647012.0, ans=0.0 2023-06-26 20:29:25,082 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.61 vs. limit=15.0 2023-06-26 20:29:25,879 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1647252.0, ans=0.2 2023-06-26 20:29:32,682 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1647252.0, ans=0.0 2023-06-26 20:29:39,773 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1647252.0, ans=0.1 2023-06-26 20:29:40,494 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=26.99 vs. limit=22.5 2023-06-26 20:29:44,271 INFO [train.py:996] (0/4) Epoch 10, batch 100, loss[loss=0.2535, simple_loss=0.3611, pruned_loss=0.07292, over 21724.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3213, pruned_loss=0.07346, over 1691345.75 frames. ], batch size: 389, lr: 3.02e-03, grad_scale: 16.0 2023-06-26 20:30:08,366 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1647372.0, ans=0.0 2023-06-26 20:30:10,804 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.82 vs. limit=22.5 2023-06-26 20:30:21,808 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1647372.0, ans=0.2 2023-06-26 20:31:06,598 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.835e+02 5.191e+02 6.971e+02 9.608e+02 1.975e+03, threshold=1.394e+03, percent-clipped=0.0 2023-06-26 20:31:13,876 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1647552.0, ans=0.125 2023-06-26 20:31:25,912 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1647552.0, ans=0.04949747468305833 2023-06-26 20:31:28,462 INFO [train.py:996] (0/4) Epoch 10, batch 150, loss[loss=0.237, simple_loss=0.3358, pruned_loss=0.06913, over 21803.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.322, pruned_loss=0.07296, over 2261144.42 frames. ], batch size: 371, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 20:31:44,581 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 20:31:56,216 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1647672.0, ans=0.1 2023-06-26 20:32:50,090 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1647792.0, ans=10.0 2023-06-26 20:32:55,314 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1647792.0, ans=0.0 2023-06-26 20:33:14,191 INFO [train.py:996] (0/4) Epoch 10, batch 200, loss[loss=0.207, simple_loss=0.2857, pruned_loss=0.06416, over 21902.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3179, pruned_loss=0.07147, over 2701496.99 frames. ], batch size: 98, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 20:33:21,799 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1647912.0, ans=0.0 2023-06-26 20:33:35,555 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.32 vs. limit=6.0 2023-06-26 20:33:43,830 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.13 vs. limit=10.0 2023-06-26 20:33:56,837 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1647972.0, ans=0.125 2023-06-26 20:34:39,713 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.906e+02 5.339e+02 8.333e+02 1.175e+03 2.265e+03, threshold=1.667e+03, percent-clipped=16.0 2023-06-26 20:34:40,383 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1648092.0, ans=0.1 2023-06-26 20:34:41,072 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.80 vs. limit=10.0 2023-06-26 20:35:01,906 INFO [train.py:996] (0/4) Epoch 10, batch 250, loss[loss=0.2139, simple_loss=0.2876, pruned_loss=0.07012, over 21850.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.3107, pruned_loss=0.07146, over 3057664.89 frames. ], batch size: 332, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 20:35:37,204 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.72 vs. limit=15.0 2023-06-26 20:35:53,605 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1648332.0, ans=0.1 2023-06-26 20:35:55,185 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1648332.0, ans=0.0 2023-06-26 20:36:26,343 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1648392.0, ans=0.025 2023-06-26 20:36:29,814 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1648392.0, ans=0.2 2023-06-26 20:36:38,842 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1648452.0, ans=0.125 2023-06-26 20:36:44,527 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.60 vs. limit=6.0 2023-06-26 20:36:54,066 INFO [train.py:996] (0/4) Epoch 10, batch 300, loss[loss=0.2251, simple_loss=0.2859, pruned_loss=0.08212, over 21274.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.3061, pruned_loss=0.07161, over 3316763.97 frames. ], batch size: 143, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 20:36:57,300 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.72 vs. limit=10.0 2023-06-26 20:36:58,792 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.33 vs. limit=15.0 2023-06-26 20:37:08,934 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1648512.0, ans=0.125 2023-06-26 20:37:12,196 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1648512.0, ans=0.1 2023-06-26 20:37:33,980 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.88 vs. limit=15.0 2023-06-26 20:37:36,758 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1648572.0, ans=0.05 2023-06-26 20:38:04,106 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1648692.0, ans=0.0 2023-06-26 20:38:11,236 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1648692.0, ans=0.125 2023-06-26 20:38:17,608 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.698e+02 5.791e+02 8.130e+02 1.304e+03 2.175e+03, threshold=1.626e+03, percent-clipped=9.0 2023-06-26 20:38:36,091 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1648752.0, ans=0.125 2023-06-26 20:38:40,493 INFO [train.py:996] (0/4) Epoch 10, batch 350, loss[loss=0.2339, simple_loss=0.301, pruned_loss=0.08341, over 21931.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.2986, pruned_loss=0.07109, over 3530871.87 frames. ], batch size: 351, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 20:38:42,708 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1648812.0, ans=0.125 2023-06-26 20:38:45,379 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.63 vs. limit=15.0 2023-06-26 20:39:08,384 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1648872.0, ans=0.125 2023-06-26 20:39:20,656 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1648872.0, ans=0.0 2023-06-26 20:39:32,626 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1648932.0, ans=0.125 2023-06-26 20:40:14,297 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.09 vs. limit=10.0 2023-06-26 20:40:24,663 INFO [train.py:996] (0/4) Epoch 10, batch 400, loss[loss=0.21, simple_loss=0.3305, pruned_loss=0.0447, over 19917.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2923, pruned_loss=0.06911, over 3683298.04 frames. ], batch size: 703, lr: 3.01e-03, grad_scale: 32.0 2023-06-26 20:40:37,126 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.02 vs. limit=15.0 2023-06-26 20:40:39,853 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=1649112.0, ans=0.95 2023-06-26 20:41:14,997 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1649232.0, ans=0.125 2023-06-26 20:41:39,750 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1649292.0, ans=0.1 2023-06-26 20:41:39,807 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1649292.0, ans=0.2 2023-06-26 20:41:53,092 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.780e+02 7.996e+02 1.335e+03 1.838e+03 3.332e+03, threshold=2.670e+03, percent-clipped=35.0 2023-06-26 20:42:14,170 INFO [train.py:996] (0/4) Epoch 10, batch 450, loss[loss=0.2645, simple_loss=0.3764, pruned_loss=0.07626, over 21746.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2922, pruned_loss=0.06798, over 3810859.59 frames. ], batch size: 414, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 20:43:16,147 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1649532.0, ans=0.2 2023-06-26 20:43:19,610 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1649532.0, ans=0.125 2023-06-26 20:43:42,833 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1649652.0, ans=0.2 2023-06-26 20:43:59,406 INFO [train.py:996] (0/4) Epoch 10, batch 500, loss[loss=0.1973, simple_loss=0.2652, pruned_loss=0.06467, over 19930.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2922, pruned_loss=0.06656, over 3918695.42 frames. ], batch size: 704, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 20:44:10,648 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1649712.0, ans=0.0 2023-06-26 20:44:31,158 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1649772.0, ans=0.09899494936611666 2023-06-26 20:44:31,720 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.45 vs. limit=15.0 2023-06-26 20:45:24,857 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.219e+02 9.001e+02 1.327e+03 2.089e+03 4.282e+03, threshold=2.653e+03, percent-clipped=10.0 2023-06-26 20:45:51,444 INFO [train.py:996] (0/4) Epoch 10, batch 550, loss[loss=0.1726, simple_loss=0.2575, pruned_loss=0.04383, over 21685.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2939, pruned_loss=0.06623, over 3996460.04 frames. ], batch size: 247, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 20:45:53,688 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1650012.0, ans=0.035 2023-06-26 20:46:01,269 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.79 vs. limit=22.5 2023-06-26 20:46:19,973 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1650072.0, ans=0.0 2023-06-26 20:46:41,130 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1650132.0, ans=0.125 2023-06-26 20:47:28,817 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1650252.0, ans=0.2 2023-06-26 20:47:33,158 INFO [train.py:996] (0/4) Epoch 10, batch 600, loss[loss=0.2423, simple_loss=0.3262, pruned_loss=0.07921, over 21793.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2965, pruned_loss=0.06705, over 4059694.86 frames. ], batch size: 371, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 20:48:58,833 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.973e+02 6.857e+02 1.039e+03 1.439e+03 2.641e+03, threshold=2.079e+03, percent-clipped=0.0 2023-06-26 20:49:06,388 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1650552.0, ans=0.125 2023-06-26 20:49:19,455 INFO [train.py:996] (0/4) Epoch 10, batch 650, loss[loss=0.2038, simple_loss=0.2819, pruned_loss=0.06287, over 21921.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2975, pruned_loss=0.06746, over 4114283.87 frames. ], batch size: 113, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 20:49:20,116 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1650612.0, ans=0.2 2023-06-26 20:49:45,611 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1650672.0, ans=0.125 2023-06-26 20:50:15,086 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1650732.0, ans=0.0 2023-06-26 20:50:47,702 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1650852.0, ans=0.04949747468305833 2023-06-26 20:50:49,205 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1650852.0, ans=0.125 2023-06-26 20:51:00,889 INFO [train.py:996] (0/4) Epoch 10, batch 700, loss[loss=0.2038, simple_loss=0.2731, pruned_loss=0.06723, over 21636.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.3002, pruned_loss=0.06879, over 4159030.71 frames. ], batch size: 230, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 20:51:01,643 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1650912.0, ans=0.2 2023-06-26 20:51:03,107 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1650912.0, ans=0.125 2023-06-26 20:51:51,254 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1651032.0, ans=0.125 2023-06-26 20:52:13,635 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1651092.0, ans=0.0 2023-06-26 20:52:17,024 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1651092.0, ans=0.0 2023-06-26 20:52:26,595 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.100e+02 6.227e+02 9.890e+02 1.482e+03 2.866e+03, threshold=1.978e+03, percent-clipped=9.0 2023-06-26 20:52:33,967 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1651152.0, ans=0.0 2023-06-26 20:52:38,939 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1651152.0, ans=0.0 2023-06-26 20:52:40,578 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1651152.0, ans=0.0 2023-06-26 20:52:47,470 INFO [train.py:996] (0/4) Epoch 10, batch 750, loss[loss=0.208, simple_loss=0.2697, pruned_loss=0.0732, over 15345.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.2965, pruned_loss=0.06846, over 4188324.62 frames. ], batch size: 63, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 20:53:02,127 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1651212.0, ans=0.0 2023-06-26 20:53:10,932 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=1651272.0, ans=15.0 2023-06-26 20:53:39,680 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 20:53:55,585 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1651392.0, ans=0.0 2023-06-26 20:54:04,474 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1651392.0, ans=0.125 2023-06-26 20:54:35,027 INFO [train.py:996] (0/4) Epoch 10, batch 800, loss[loss=0.2116, simple_loss=0.2843, pruned_loss=0.06946, over 21345.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2951, pruned_loss=0.06838, over 4203058.39 frames. ], batch size: 159, lr: 3.01e-03, grad_scale: 32.0 2023-06-26 20:54:36,117 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1651512.0, ans=0.125 2023-06-26 20:55:06,364 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=1651572.0, ans=15.0 2023-06-26 20:55:19,894 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1651632.0, ans=0.0 2023-06-26 20:55:24,835 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1651632.0, ans=0.125 2023-06-26 20:56:04,635 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.663e+02 5.824e+02 9.070e+02 1.319e+03 2.505e+03, threshold=1.814e+03, percent-clipped=4.0 2023-06-26 20:56:23,640 INFO [train.py:996] (0/4) Epoch 10, batch 850, loss[loss=0.2755, simple_loss=0.3224, pruned_loss=0.1143, over 21761.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2929, pruned_loss=0.06862, over 4224042.95 frames. ], batch size: 508, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 20:56:33,800 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1651812.0, ans=0.05 2023-06-26 20:57:18,570 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1651932.0, ans=0.125 2023-06-26 20:57:33,422 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.21 vs. limit=6.0 2023-06-26 20:58:18,442 INFO [train.py:996] (0/4) Epoch 10, batch 900, loss[loss=0.2535, simple_loss=0.3154, pruned_loss=0.09584, over 21778.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.291, pruned_loss=0.06866, over 4239233.60 frames. ], batch size: 441, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 20:58:55,243 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1652172.0, ans=0.07 2023-06-26 20:59:26,885 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 20:59:42,416 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.674e+02 4.955e+02 6.528e+02 1.022e+03 3.124e+03, threshold=1.306e+03, percent-clipped=4.0 2023-06-26 21:00:07,569 INFO [train.py:996] (0/4) Epoch 10, batch 950, loss[loss=0.1653, simple_loss=0.2577, pruned_loss=0.03648, over 21624.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.2892, pruned_loss=0.06908, over 4250372.15 frames. ], batch size: 230, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 21:00:08,410 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1652412.0, ans=0.125 2023-06-26 21:00:17,996 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.45 vs. limit=15.0 2023-06-26 21:00:58,602 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.43 vs. limit=10.0 2023-06-26 21:01:17,226 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1652592.0, ans=0.0 2023-06-26 21:01:56,951 INFO [train.py:996] (0/4) Epoch 10, batch 1000, loss[loss=0.2548, simple_loss=0.3264, pruned_loss=0.09157, over 21539.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2898, pruned_loss=0.06909, over 4262561.90 frames. ], batch size: 507, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 21:02:11,993 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1652712.0, ans=0.125 2023-06-26 21:02:15,721 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1652712.0, ans=0.125 2023-06-26 21:03:22,661 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=1652892.0, ans=22.5 2023-06-26 21:03:26,493 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.46 vs. limit=22.5 2023-06-26 21:03:31,703 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.126e+02 7.237e+02 1.217e+03 1.852e+03 3.276e+03, threshold=2.433e+03, percent-clipped=47.0 2023-06-26 21:03:41,114 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1652952.0, ans=0.125 2023-06-26 21:03:56,364 INFO [train.py:996] (0/4) Epoch 10, batch 1050, loss[loss=0.1777, simple_loss=0.2479, pruned_loss=0.05376, over 16043.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2908, pruned_loss=0.06885, over 4261661.46 frames. ], batch size: 60, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 21:04:04,053 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.76 vs. limit=10.0 2023-06-26 21:05:46,793 INFO [train.py:996] (0/4) Epoch 10, batch 1100, loss[loss=0.1908, simple_loss=0.2556, pruned_loss=0.06297, over 21260.00 frames. ], tot_loss[loss=0.212, simple_loss=0.2891, pruned_loss=0.06744, over 4269704.04 frames. ], batch size: 608, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 21:05:48,016 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.94 vs. limit=22.5 2023-06-26 21:06:37,872 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.10 vs. limit=15.0 2023-06-26 21:07:07,783 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1653492.0, ans=0.125 2023-06-26 21:07:14,105 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.610e+02 5.858e+02 8.624e+02 1.218e+03 2.996e+03, threshold=1.725e+03, percent-clipped=2.0 2023-06-26 21:07:38,288 INFO [train.py:996] (0/4) Epoch 10, batch 1150, loss[loss=0.2268, simple_loss=0.31, pruned_loss=0.07184, over 21842.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2902, pruned_loss=0.0681, over 4277854.54 frames. ], batch size: 332, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 21:07:54,189 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.19 vs. limit=22.5 2023-06-26 21:08:13,975 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1653672.0, ans=0.125 2023-06-26 21:08:22,958 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1653672.0, ans=0.125 2023-06-26 21:09:36,635 INFO [train.py:996] (0/4) Epoch 10, batch 1200, loss[loss=0.2577, simple_loss=0.3174, pruned_loss=0.09901, over 21587.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2925, pruned_loss=0.0688, over 4286191.80 frames. ], batch size: 471, lr: 3.01e-03, grad_scale: 32.0 2023-06-26 21:09:40,666 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1653912.0, ans=0.1 2023-06-26 21:09:53,162 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1653912.0, ans=0.0 2023-06-26 21:09:57,264 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.25 vs. limit=22.5 2023-06-26 21:10:06,719 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1653972.0, ans=0.2 2023-06-26 21:11:00,198 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.823e+02 5.719e+02 8.661e+02 1.239e+03 3.080e+03, threshold=1.732e+03, percent-clipped=10.0 2023-06-26 21:11:21,440 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1654152.0, ans=0.125 2023-06-26 21:11:23,370 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 21:11:25,981 INFO [train.py:996] (0/4) Epoch 10, batch 1250, loss[loss=0.2119, simple_loss=0.315, pruned_loss=0.05438, over 21686.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2953, pruned_loss=0.06884, over 4290651.51 frames. ], batch size: 247, lr: 3.01e-03, grad_scale: 32.0 2023-06-26 21:11:34,241 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.53 vs. limit=12.0 2023-06-26 21:12:46,389 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1654392.0, ans=0.125 2023-06-26 21:13:16,671 INFO [train.py:996] (0/4) Epoch 10, batch 1300, loss[loss=0.2991, simple_loss=0.3552, pruned_loss=0.1215, over 21380.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2965, pruned_loss=0.06909, over 4294325.47 frames. ], batch size: 509, lr: 3.01e-03, grad_scale: 32.0 2023-06-26 21:13:35,536 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1654512.0, ans=0.125 2023-06-26 21:14:43,544 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.194e+02 7.398e+02 1.015e+03 1.513e+03 3.841e+03, threshold=2.029e+03, percent-clipped=13.0 2023-06-26 21:15:06,107 INFO [train.py:996] (0/4) Epoch 10, batch 1350, loss[loss=0.2136, simple_loss=0.285, pruned_loss=0.07106, over 21864.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.2952, pruned_loss=0.06922, over 4288766.87 frames. ], batch size: 371, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 21:15:29,778 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1654872.0, ans=0.125 2023-06-26 21:16:25,185 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1654992.0, ans=0.0 2023-06-26 21:17:00,106 INFO [train.py:996] (0/4) Epoch 10, batch 1400, loss[loss=0.2103, simple_loss=0.2891, pruned_loss=0.06576, over 21847.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2948, pruned_loss=0.06964, over 4291794.84 frames. ], batch size: 372, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 21:17:51,394 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1655232.0, ans=0.2 2023-06-26 21:17:57,962 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1655292.0, ans=0.125 2023-06-26 21:18:25,055 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.941e+02 5.863e+02 9.944e+02 1.473e+03 3.016e+03, threshold=1.989e+03, percent-clipped=13.0 2023-06-26 21:18:26,175 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1655352.0, ans=0.125 2023-06-26 21:18:44,441 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.60 vs. limit=15.0 2023-06-26 21:18:48,213 INFO [train.py:996] (0/4) Epoch 10, batch 1450, loss[loss=0.2257, simple_loss=0.3001, pruned_loss=0.07562, over 21781.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.2969, pruned_loss=0.07049, over 4292872.99 frames. ], batch size: 332, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 21:18:53,830 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1655412.0, ans=0.125 2023-06-26 21:19:46,970 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.53 vs. limit=15.0 2023-06-26 21:20:36,879 INFO [train.py:996] (0/4) Epoch 10, batch 1500, loss[loss=0.2594, simple_loss=0.348, pruned_loss=0.0854, over 21685.00 frames. ], tot_loss[loss=0.22, simple_loss=0.2975, pruned_loss=0.07123, over 4297568.56 frames. ], batch size: 441, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 21:21:21,177 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1655832.0, ans=0.0 2023-06-26 21:21:28,868 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.27 vs. limit=6.0 2023-06-26 21:22:03,788 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.679e+02 5.579e+02 7.007e+02 1.027e+03 2.656e+03, threshold=1.401e+03, percent-clipped=4.0 2023-06-26 21:22:04,761 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 21:22:23,051 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-276000.pt 2023-06-26 21:22:29,803 INFO [train.py:996] (0/4) Epoch 10, batch 1550, loss[loss=0.2125, simple_loss=0.306, pruned_loss=0.05953, over 21794.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.295, pruned_loss=0.07066, over 4299818.14 frames. ], batch size: 316, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 21:23:51,250 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1656192.0, ans=0.04949747468305833 2023-06-26 21:24:18,733 INFO [train.py:996] (0/4) Epoch 10, batch 1600, loss[loss=0.2351, simple_loss=0.3117, pruned_loss=0.07925, over 21793.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2943, pruned_loss=0.07056, over 4299446.07 frames. ], batch size: 351, lr: 3.01e-03, grad_scale: 32.0 2023-06-26 21:25:50,452 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.007e+02 6.112e+02 1.058e+03 1.502e+03 3.121e+03, threshold=2.116e+03, percent-clipped=30.0 2023-06-26 21:25:54,579 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1656552.0, ans=0.125 2023-06-26 21:26:07,907 INFO [train.py:996] (0/4) Epoch 10, batch 1650, loss[loss=0.2262, simple_loss=0.3095, pruned_loss=0.07143, over 21764.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2919, pruned_loss=0.0693, over 4289495.61 frames. ], batch size: 332, lr: 3.01e-03, grad_scale: 32.0 2023-06-26 21:26:54,034 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1656732.0, ans=0.1 2023-06-26 21:27:04,752 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1656732.0, ans=0.2 2023-06-26 21:27:37,291 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1656792.0, ans=0.0 2023-06-26 21:28:04,352 INFO [train.py:996] (0/4) Epoch 10, batch 1700, loss[loss=0.2547, simple_loss=0.335, pruned_loss=0.08723, over 21821.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.2937, pruned_loss=0.07001, over 4282460.36 frames. ], batch size: 124, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 21:28:26,693 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1656972.0, ans=0.0 2023-06-26 21:29:23,191 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.93 vs. limit=10.0 2023-06-26 21:29:31,979 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1657152.0, ans=0.125 2023-06-26 21:29:39,863 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.61 vs. limit=15.0 2023-06-26 21:29:40,235 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.858e+02 6.519e+02 9.043e+02 1.348e+03 2.914e+03, threshold=1.809e+03, percent-clipped=3.0 2023-06-26 21:29:55,214 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1657212.0, ans=0.0 2023-06-26 21:29:56,229 INFO [train.py:996] (0/4) Epoch 10, batch 1750, loss[loss=0.1601, simple_loss=0.2251, pruned_loss=0.04759, over 21151.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.294, pruned_loss=0.06867, over 4276657.97 frames. ], batch size: 143, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 21:31:40,872 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.24 vs. limit=15.0 2023-06-26 21:31:41,949 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1657452.0, ans=0.125 2023-06-26 21:31:54,541 INFO [train.py:996] (0/4) Epoch 10, batch 1800, loss[loss=0.2194, simple_loss=0.3165, pruned_loss=0.06112, over 19807.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2934, pruned_loss=0.06688, over 4275330.14 frames. ], batch size: 703, lr: 3.01e-03, grad_scale: 8.0 2023-06-26 21:32:03,865 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 21:32:18,796 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1657572.0, ans=0.0 2023-06-26 21:32:22,342 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1657572.0, ans=0.1 2023-06-26 21:32:39,267 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=6.44 vs. limit=22.5 2023-06-26 21:33:24,924 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.986e+02 5.658e+02 9.190e+02 1.767e+03 4.020e+03, threshold=1.838e+03, percent-clipped=23.0 2023-06-26 21:33:40,398 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.62 vs. limit=15.0 2023-06-26 21:33:44,330 INFO [train.py:996] (0/4) Epoch 10, batch 1850, loss[loss=0.2367, simple_loss=0.3352, pruned_loss=0.06909, over 21622.00 frames. ], tot_loss[loss=0.2133, simple_loss=0.2954, pruned_loss=0.0656, over 4271253.40 frames. ], batch size: 389, lr: 3.01e-03, grad_scale: 8.0 2023-06-26 21:34:37,398 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1657932.0, ans=0.125 2023-06-26 21:35:10,089 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1658052.0, ans=0.125 2023-06-26 21:35:32,235 INFO [train.py:996] (0/4) Epoch 10, batch 1900, loss[loss=0.2043, simple_loss=0.2959, pruned_loss=0.05633, over 21738.00 frames. ], tot_loss[loss=0.2129, simple_loss=0.2949, pruned_loss=0.06544, over 4270348.26 frames. ], batch size: 298, lr: 3.01e-03, grad_scale: 8.0 2023-06-26 21:35:38,009 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1658112.0, ans=0.1 2023-06-26 21:35:57,544 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1658172.0, ans=0.2 2023-06-26 21:36:15,806 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1658172.0, ans=0.1 2023-06-26 21:37:01,852 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1658352.0, ans=0.125 2023-06-26 21:37:08,215 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.968e+02 6.601e+02 8.691e+02 1.330e+03 2.480e+03, threshold=1.738e+03, percent-clipped=9.0 2023-06-26 21:37:08,827 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1658352.0, ans=0.2 2023-06-26 21:37:15,707 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1658352.0, ans=0.1 2023-06-26 21:37:22,023 INFO [train.py:996] (0/4) Epoch 10, batch 1950, loss[loss=0.17, simple_loss=0.2482, pruned_loss=0.04585, over 21283.00 frames. ], tot_loss[loss=0.2123, simple_loss=0.2928, pruned_loss=0.06589, over 4270035.53 frames. ], batch size: 131, lr: 3.00e-03, grad_scale: 8.0 2023-06-26 21:38:47,315 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1658652.0, ans=0.0 2023-06-26 21:39:06,736 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1658652.0, ans=0.0 2023-06-26 21:39:11,007 INFO [train.py:996] (0/4) Epoch 10, batch 2000, loss[loss=0.1994, simple_loss=0.302, pruned_loss=0.04834, over 20811.00 frames. ], tot_loss[loss=0.2084, simple_loss=0.2881, pruned_loss=0.0643, over 4268824.48 frames. ], batch size: 607, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 21:40:30,938 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1658892.0, ans=0.2 2023-06-26 21:40:46,012 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.256e+02 7.434e+02 1.051e+03 1.825e+03 4.116e+03, threshold=2.102e+03, percent-clipped=26.0 2023-06-26 21:40:57,600 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1658952.0, ans=0.0 2023-06-26 21:41:00,341 INFO [train.py:996] (0/4) Epoch 10, batch 2050, loss[loss=0.1799, simple_loss=0.2869, pruned_loss=0.03641, over 21141.00 frames. ], tot_loss[loss=0.2082, simple_loss=0.2891, pruned_loss=0.06364, over 4270653.29 frames. ], batch size: 548, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 21:41:36,184 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1659072.0, ans=0.1 2023-06-26 21:41:47,541 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=12.63 vs. limit=15.0 2023-06-26 21:41:52,576 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.72 vs. limit=22.5 2023-06-26 21:41:56,241 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.51 vs. limit=15.0 2023-06-26 21:42:08,801 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1659192.0, ans=0.125 2023-06-26 21:42:12,065 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1659192.0, ans=0.125 2023-06-26 21:42:26,237 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1659192.0, ans=0.125 2023-06-26 21:42:52,078 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1659312.0, ans=0.1 2023-06-26 21:42:53,064 INFO [train.py:996] (0/4) Epoch 10, batch 2100, loss[loss=0.2349, simple_loss=0.322, pruned_loss=0.07387, over 21572.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2943, pruned_loss=0.0661, over 4271703.42 frames. ], batch size: 230, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 21:42:59,319 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.06 vs. limit=15.0 2023-06-26 21:43:18,173 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1659372.0, ans=0.0 2023-06-26 21:43:19,845 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1659372.0, ans=0.2 2023-06-26 21:43:34,832 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1659372.0, ans=0.95 2023-06-26 21:44:22,018 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.975e+02 6.453e+02 1.021e+03 1.329e+03 2.280e+03, threshold=2.042e+03, percent-clipped=5.0 2023-06-26 21:44:41,194 INFO [train.py:996] (0/4) Epoch 10, batch 2150, loss[loss=0.2073, simple_loss=0.2866, pruned_loss=0.06396, over 21845.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2938, pruned_loss=0.06688, over 4276178.54 frames. ], batch size: 372, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 21:45:21,474 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1659672.0, ans=0.2 2023-06-26 21:45:41,279 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1659732.0, ans=0.0 2023-06-26 21:46:08,520 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1659852.0, ans=0.125 2023-06-26 21:46:08,532 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1659852.0, ans=0.125 2023-06-26 21:46:29,854 INFO [train.py:996] (0/4) Epoch 10, batch 2200, loss[loss=0.2291, simple_loss=0.292, pruned_loss=0.08311, over 21388.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2941, pruned_loss=0.06748, over 4280713.49 frames. ], batch size: 471, lr: 3.00e-03, grad_scale: 8.0 2023-06-26 21:47:10,113 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1659972.0, ans=0.0 2023-06-26 21:47:13,412 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1659972.0, ans=0.0 2023-06-26 21:47:39,070 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1660092.0, ans=0.125 2023-06-26 21:47:47,668 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1660092.0, ans=0.125 2023-06-26 21:47:50,885 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1660092.0, ans=0.1 2023-06-26 21:48:00,548 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.021e+02 5.718e+02 8.930e+02 1.284e+03 2.710e+03, threshold=1.786e+03, percent-clipped=5.0 2023-06-26 21:48:17,690 INFO [train.py:996] (0/4) Epoch 10, batch 2250, loss[loss=0.1966, simple_loss=0.2605, pruned_loss=0.06628, over 21146.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.2923, pruned_loss=0.06624, over 4286312.82 frames. ], batch size: 143, lr: 3.00e-03, grad_scale: 8.0 2023-06-26 21:48:42,209 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.79 vs. limit=15.0 2023-06-26 21:49:17,617 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1660332.0, ans=0.0 2023-06-26 21:49:44,839 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1660452.0, ans=0.0 2023-06-26 21:49:53,701 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1660452.0, ans=0.2 2023-06-26 21:50:04,575 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.47 vs. limit=15.0 2023-06-26 21:50:04,943 INFO [train.py:996] (0/4) Epoch 10, batch 2300, loss[loss=0.2129, simple_loss=0.2977, pruned_loss=0.06406, over 21220.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2891, pruned_loss=0.06583, over 4290085.42 frames. ], batch size: 176, lr: 3.00e-03, grad_scale: 8.0 2023-06-26 21:50:38,687 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1660572.0, ans=0.125 2023-06-26 21:51:40,845 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.466e+02 6.340e+02 1.061e+03 1.425e+03 3.450e+03, threshold=2.122e+03, percent-clipped=15.0 2023-06-26 21:51:44,880 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1660752.0, ans=0.0 2023-06-26 21:51:48,564 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1660752.0, ans=0.125 2023-06-26 21:51:52,953 INFO [train.py:996] (0/4) Epoch 10, batch 2350, loss[loss=0.204, simple_loss=0.2532, pruned_loss=0.07739, over 20110.00 frames. ], tot_loss[loss=0.2091, simple_loss=0.2861, pruned_loss=0.06605, over 4283751.65 frames. ], batch size: 703, lr: 3.00e-03, grad_scale: 8.0 2023-06-26 21:52:08,142 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1660812.0, ans=0.125 2023-06-26 21:52:25,989 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1660872.0, ans=0.2 2023-06-26 21:52:29,504 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1660872.0, ans=0.1 2023-06-26 21:52:33,072 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1660872.0, ans=0.125 2023-06-26 21:53:35,346 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.26 vs. limit=12.0 2023-06-26 21:53:45,718 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1661112.0, ans=0.1 2023-06-26 21:53:46,716 INFO [train.py:996] (0/4) Epoch 10, batch 2400, loss[loss=0.2372, simple_loss=0.3085, pruned_loss=0.08294, over 21786.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.288, pruned_loss=0.06769, over 4273793.18 frames. ], batch size: 247, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 21:53:52,835 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1661112.0, ans=0.0 2023-06-26 21:54:45,376 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1661232.0, ans=0.125 2023-06-26 21:54:45,905 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.61 vs. limit=15.0 2023-06-26 21:55:01,181 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1661292.0, ans=0.125 2023-06-26 21:55:17,530 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.139e+02 8.857e+02 1.254e+03 1.714e+03 3.828e+03, threshold=2.507e+03, percent-clipped=13.0 2023-06-26 21:55:35,123 INFO [train.py:996] (0/4) Epoch 10, batch 2450, loss[loss=0.2347, simple_loss=0.3218, pruned_loss=0.07375, over 21354.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2931, pruned_loss=0.06976, over 4272242.93 frames. ], batch size: 548, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 21:55:50,073 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1661412.0, ans=0.125 2023-06-26 21:55:51,985 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1661412.0, ans=0.125 2023-06-26 21:56:16,639 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.41 vs. limit=15.0 2023-06-26 21:56:58,334 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.37 vs. limit=15.0 2023-06-26 21:57:22,910 INFO [train.py:996] (0/4) Epoch 10, batch 2500, loss[loss=0.2307, simple_loss=0.3199, pruned_loss=0.07071, over 19843.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.291, pruned_loss=0.06915, over 4256605.46 frames. ], batch size: 702, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 21:57:25,748 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.23 vs. limit=15.0 2023-06-26 21:57:36,542 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1661712.0, ans=0.125 2023-06-26 21:57:58,174 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1661772.0, ans=0.2 2023-06-26 21:58:19,681 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.66 vs. limit=6.0 2023-06-26 21:58:52,894 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.949e+02 5.442e+02 7.727e+02 1.360e+03 2.872e+03, threshold=1.545e+03, percent-clipped=3.0 2023-06-26 21:59:16,985 INFO [train.py:996] (0/4) Epoch 10, batch 2550, loss[loss=0.2295, simple_loss=0.3112, pruned_loss=0.07386, over 21667.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.2909, pruned_loss=0.06828, over 4256369.78 frames. ], batch size: 298, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 22:00:12,916 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1662192.0, ans=0.125 2023-06-26 22:00:19,663 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1662192.0, ans=0.0 2023-06-26 22:00:58,675 INFO [train.py:996] (0/4) Epoch 10, batch 2600, loss[loss=0.2376, simple_loss=0.3256, pruned_loss=0.07479, over 17285.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2905, pruned_loss=0.06903, over 4256567.44 frames. ], batch size: 60, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 22:01:22,368 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1662372.0, ans=0.125 2023-06-26 22:01:24,337 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1662372.0, ans=0.2 2023-06-26 22:01:24,938 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.05 vs. limit=15.0 2023-06-26 22:02:05,851 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1662492.0, ans=0.0 2023-06-26 22:02:30,508 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.653e+02 5.932e+02 7.910e+02 1.183e+03 2.273e+03, threshold=1.582e+03, percent-clipped=10.0 2023-06-26 22:02:48,663 INFO [train.py:996] (0/4) Epoch 10, batch 2650, loss[loss=0.2458, simple_loss=0.3112, pruned_loss=0.09017, over 21360.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.2921, pruned_loss=0.07081, over 4266027.54 frames. ], batch size: 549, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 22:04:43,342 INFO [train.py:996] (0/4) Epoch 10, batch 2700, loss[loss=0.1786, simple_loss=0.2468, pruned_loss=0.05522, over 21422.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2922, pruned_loss=0.07086, over 4274529.82 frames. ], batch size: 194, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 22:04:45,539 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1662912.0, ans=0.0 2023-06-26 22:05:29,613 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1663032.0, ans=0.0 2023-06-26 22:05:29,651 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1663032.0, ans=0.0 2023-06-26 22:05:59,598 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 22:06:09,115 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.093e+02 5.804e+02 8.533e+02 1.371e+03 2.390e+03, threshold=1.707e+03, percent-clipped=16.0 2023-06-26 22:06:31,063 INFO [train.py:996] (0/4) Epoch 10, batch 2750, loss[loss=0.2216, simple_loss=0.3056, pruned_loss=0.06879, over 21834.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.292, pruned_loss=0.07083, over 4273739.28 frames. ], batch size: 118, lr: 3.00e-03, grad_scale: 8.0 2023-06-26 22:06:40,806 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1663212.0, ans=0.035 2023-06-26 22:06:50,437 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn2.whiten.whitening_limit, batch_count=1663272.0, ans=22.5 2023-06-26 22:07:08,626 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 22:07:14,533 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.80 vs. limit=15.0 2023-06-26 22:07:27,878 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1663392.0, ans=0.0 2023-06-26 22:08:21,207 INFO [train.py:996] (0/4) Epoch 10, batch 2800, loss[loss=0.1974, simple_loss=0.313, pruned_loss=0.04088, over 19700.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.295, pruned_loss=0.07123, over 4278234.68 frames. ], batch size: 702, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 22:09:17,843 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.29 vs. limit=15.0 2023-06-26 22:09:25,901 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1663692.0, ans=0.2 2023-06-26 22:10:00,638 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.922e+02 7.435e+02 1.264e+03 2.282e+03 6.620e+03, threshold=2.529e+03, percent-clipped=31.0 2023-06-26 22:10:11,258 INFO [train.py:996] (0/4) Epoch 10, batch 2850, loss[loss=0.2288, simple_loss=0.2872, pruned_loss=0.08517, over 21553.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.2966, pruned_loss=0.07238, over 4280736.39 frames. ], batch size: 548, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 22:10:29,295 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 22:10:43,060 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1663872.0, ans=0.0 2023-06-26 22:10:48,698 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.97 vs. limit=6.0 2023-06-26 22:11:01,976 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1663932.0, ans=0.1 2023-06-26 22:11:13,083 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.13 vs. limit=6.0 2023-06-26 22:11:59,710 INFO [train.py:996] (0/4) Epoch 10, batch 2900, loss[loss=0.2015, simple_loss=0.2712, pruned_loss=0.06593, over 20096.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.2943, pruned_loss=0.07131, over 4284778.43 frames. ], batch size: 702, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 22:12:11,201 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1664112.0, ans=0.1 2023-06-26 22:13:16,208 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1664292.0, ans=0.2 2023-06-26 22:13:34,173 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.64 vs. limit=15.0 2023-06-26 22:13:38,081 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.904e+02 5.286e+02 7.202e+02 1.145e+03 2.929e+03, threshold=1.440e+03, percent-clipped=1.0 2023-06-26 22:13:46,799 INFO [train.py:996] (0/4) Epoch 10, batch 2950, loss[loss=0.2187, simple_loss=0.299, pruned_loss=0.06926, over 21846.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2947, pruned_loss=0.07064, over 4282238.68 frames. ], batch size: 371, lr: 3.00e-03, grad_scale: 8.0 2023-06-26 22:14:04,410 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1664412.0, ans=0.125 2023-06-26 22:14:09,755 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1664472.0, ans=0.0 2023-06-26 22:14:45,644 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1664532.0, ans=0.07 2023-06-26 22:15:00,717 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1664592.0, ans=0.125 2023-06-26 22:15:29,607 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1664652.0, ans=0.0 2023-06-26 22:15:40,775 INFO [train.py:996] (0/4) Epoch 10, batch 3000, loss[loss=0.2179, simple_loss=0.3131, pruned_loss=0.06131, over 19761.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.2977, pruned_loss=0.07085, over 4284770.20 frames. ], batch size: 704, lr: 3.00e-03, grad_scale: 8.0 2023-06-26 22:15:40,776 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-26 22:15:55,415 INFO [zipformer.py:1728] (0/4) name=encoder.encoders.3.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([2.4590, 3.1598, 3.4223, 3.5321, 2.9981, 2.9333, 3.5978, 3.4797], device='cuda:0') 2023-06-26 22:15:58,652 INFO [train.py:1028] (0/4) Epoch 10, validation: loss=0.2517, simple_loss=0.3411, pruned_loss=0.08118, over 1796401.00 frames. 2023-06-26 22:15:58,653 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23714MB 2023-06-26 22:16:08,195 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1664712.0, ans=0.125 2023-06-26 22:16:42,866 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1664772.0, ans=0.2 2023-06-26 22:17:39,710 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.020e+02 5.823e+02 1.007e+03 1.425e+03 2.943e+03, threshold=2.014e+03, percent-clipped=25.0 2023-06-26 22:17:48,241 INFO [train.py:996] (0/4) Epoch 10, batch 3050, loss[loss=0.1989, simple_loss=0.2771, pruned_loss=0.06038, over 21882.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.2994, pruned_loss=0.07001, over 4281856.07 frames. ], batch size: 316, lr: 3.00e-03, grad_scale: 8.0 2023-06-26 22:17:58,261 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.05 vs. limit=10.0 2023-06-26 22:18:05,370 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.20 vs. limit=15.0 2023-06-26 22:18:18,966 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.34 vs. limit=15.0 2023-06-26 22:19:00,134 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1665192.0, ans=0.0 2023-06-26 22:19:00,217 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1665192.0, ans=0.0 2023-06-26 22:19:15,815 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1665192.0, ans=0.125 2023-06-26 22:19:31,532 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1665252.0, ans=0.125 2023-06-26 22:19:33,323 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1665252.0, ans=0.0 2023-06-26 22:19:37,777 INFO [train.py:996] (0/4) Epoch 10, batch 3100, loss[loss=0.2504, simple_loss=0.3196, pruned_loss=0.0906, over 21760.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.2998, pruned_loss=0.06952, over 4282545.88 frames. ], batch size: 441, lr: 3.00e-03, grad_scale: 8.0 2023-06-26 22:20:43,380 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.42 vs. limit=22.5 2023-06-26 22:21:17,182 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.635e+02 5.384e+02 7.508e+02 1.175e+03 3.644e+03, threshold=1.502e+03, percent-clipped=4.0 2023-06-26 22:21:26,439 INFO [train.py:996] (0/4) Epoch 10, batch 3150, loss[loss=0.172, simple_loss=0.2614, pruned_loss=0.04128, over 21656.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.2997, pruned_loss=0.06908, over 4278683.13 frames. ], batch size: 263, lr: 3.00e-03, grad_scale: 8.0 2023-06-26 22:21:56,714 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1665672.0, ans=0.1 2023-06-26 22:22:14,945 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_ff2.min_abs, batch_count=1665672.0, ans=0.1 2023-06-26 22:23:03,283 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.30 vs. limit=5.0 2023-06-26 22:23:11,619 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.91 vs. limit=6.0 2023-06-26 22:23:22,055 INFO [train.py:996] (0/4) Epoch 10, batch 3200, loss[loss=0.1758, simple_loss=0.2537, pruned_loss=0.04901, over 21299.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.3021, pruned_loss=0.06924, over 4285412.11 frames. ], batch size: 176, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 22:23:45,315 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1665972.0, ans=0.125 2023-06-26 22:23:59,569 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1665972.0, ans=0.125 2023-06-26 22:24:13,569 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1666032.0, ans=0.125 2023-06-26 22:24:53,348 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1666152.0, ans=0.0 2023-06-26 22:25:01,080 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.433e+02 6.467e+02 1.041e+03 1.408e+03 2.668e+03, threshold=2.081e+03, percent-clipped=19.0 2023-06-26 22:25:14,948 INFO [train.py:996] (0/4) Epoch 10, batch 3250, loss[loss=0.2434, simple_loss=0.2856, pruned_loss=0.1006, over 21419.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.3032, pruned_loss=0.07057, over 4281824.69 frames. ], batch size: 510, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 22:25:22,344 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1666212.0, ans=0.2 2023-06-26 22:25:46,452 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1666272.0, ans=0.125 2023-06-26 22:26:16,347 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1666392.0, ans=0.025 2023-06-26 22:27:04,020 INFO [train.py:996] (0/4) Epoch 10, batch 3300, loss[loss=0.2565, simple_loss=0.3553, pruned_loss=0.07885, over 21609.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.2987, pruned_loss=0.06977, over 4285101.86 frames. ], batch size: 441, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 22:28:42,875 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.261e+02 7.426e+02 1.088e+03 1.707e+03 4.708e+03, threshold=2.176e+03, percent-clipped=17.0 2023-06-26 22:28:51,839 INFO [train.py:996] (0/4) Epoch 10, batch 3350, loss[loss=0.2261, simple_loss=0.3061, pruned_loss=0.07302, over 21485.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.2995, pruned_loss=0.06982, over 4280134.97 frames. ], batch size: 131, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 22:29:05,456 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1666812.0, ans=0.125 2023-06-26 22:29:07,290 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=1666812.0, ans=0.025 2023-06-26 22:30:31,748 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1667052.0, ans=0.05 2023-06-26 22:30:39,085 INFO [train.py:996] (0/4) Epoch 10, batch 3400, loss[loss=0.197, simple_loss=0.2665, pruned_loss=0.06372, over 21757.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.3011, pruned_loss=0.0702, over 4281902.83 frames. ], batch size: 112, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 22:31:48,492 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1667292.0, ans=0.1 2023-06-26 22:32:20,110 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.256e+02 6.513e+02 9.750e+02 1.536e+03 3.496e+03, threshold=1.950e+03, percent-clipped=9.0 2023-06-26 22:32:34,455 INFO [train.py:996] (0/4) Epoch 10, batch 3450, loss[loss=0.2025, simple_loss=0.2672, pruned_loss=0.06884, over 21484.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.297, pruned_loss=0.06988, over 4274403.53 frames. ], batch size: 441, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 22:33:45,832 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1667592.0, ans=0.2 2023-06-26 22:34:21,913 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.76 vs. limit=15.0 2023-06-26 22:34:24,146 INFO [train.py:996] (0/4) Epoch 10, batch 3500, loss[loss=0.2304, simple_loss=0.3175, pruned_loss=0.07162, over 21421.00 frames. ], tot_loss[loss=0.225, simple_loss=0.3043, pruned_loss=0.0729, over 4273817.99 frames. ], batch size: 131, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 22:34:39,394 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.31 vs. limit=22.5 2023-06-26 22:34:51,115 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1667772.0, ans=0.2 2023-06-26 22:35:05,510 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.82 vs. limit=15.0 2023-06-26 22:35:18,151 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.29 vs. limit=22.5 2023-06-26 22:35:43,970 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1667892.0, ans=0.125 2023-06-26 22:35:44,774 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.86 vs. limit=5.0 2023-06-26 22:35:58,264 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1667952.0, ans=0.1 2023-06-26 22:36:04,387 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.012e+02 7.162e+02 1.009e+03 1.814e+03 3.226e+03, threshold=2.018e+03, percent-clipped=21.0 2023-06-26 22:36:13,110 INFO [train.py:996] (0/4) Epoch 10, batch 3550, loss[loss=0.2272, simple_loss=0.2713, pruned_loss=0.09153, over 21312.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3064, pruned_loss=0.07511, over 4279066.49 frames. ], batch size: 507, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 22:36:37,174 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1668072.0, ans=0.0 2023-06-26 22:37:24,938 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.66 vs. limit=15.0 2023-06-26 22:37:29,643 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1668192.0, ans=0.0 2023-06-26 22:37:52,489 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1668252.0, ans=0.125 2023-06-26 22:38:06,103 INFO [train.py:996] (0/4) Epoch 10, batch 3600, loss[loss=0.257, simple_loss=0.334, pruned_loss=0.08996, over 21804.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.3007, pruned_loss=0.07416, over 4283250.57 frames. ], batch size: 124, lr: 3.00e-03, grad_scale: 32.0 2023-06-26 22:38:17,590 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1668312.0, ans=0.0 2023-06-26 22:39:04,945 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1668432.0, ans=0.125 2023-06-26 22:39:42,555 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.944e+02 5.183e+02 6.801e+02 1.024e+03 2.371e+03, threshold=1.360e+03, percent-clipped=4.0 2023-06-26 22:39:50,311 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1668552.0, ans=0.125 2023-06-26 22:39:54,939 INFO [train.py:996] (0/4) Epoch 10, batch 3650, loss[loss=0.2439, simple_loss=0.321, pruned_loss=0.08346, over 21293.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.3023, pruned_loss=0.07513, over 4286278.55 frames. ], batch size: 159, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 22:40:04,380 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1668612.0, ans=0.0 2023-06-26 22:40:36,615 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1668732.0, ans=0.2 2023-06-26 22:40:59,601 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1668792.0, ans=0.025 2023-06-26 22:41:08,664 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.69 vs. limit=8.0 2023-06-26 22:41:28,558 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1668852.0, ans=0.125 2023-06-26 22:41:40,255 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1668912.0, ans=0.125 2023-06-26 22:41:41,217 INFO [train.py:996] (0/4) Epoch 10, batch 3700, loss[loss=0.216, simple_loss=0.3007, pruned_loss=0.0656, over 21272.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.3019, pruned_loss=0.0744, over 4287474.51 frames. ], batch size: 159, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 22:42:17,011 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.88 vs. limit=10.0 2023-06-26 22:42:50,365 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1669092.0, ans=0.1 2023-06-26 22:43:23,446 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.784e+02 6.221e+02 8.574e+02 1.297e+03 2.866e+03, threshold=1.715e+03, percent-clipped=21.0 2023-06-26 22:43:30,713 INFO [train.py:996] (0/4) Epoch 10, batch 3750, loss[loss=0.2053, simple_loss=0.2845, pruned_loss=0.06309, over 21834.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.3007, pruned_loss=0.074, over 4290248.62 frames. ], batch size: 391, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 22:43:55,458 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.98 vs. limit=8.0 2023-06-26 22:44:22,513 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.83 vs. limit=15.0 2023-06-26 22:44:48,543 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1669392.0, ans=0.2 2023-06-26 22:44:57,426 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1669392.0, ans=0.2 2023-06-26 22:45:18,864 INFO [train.py:996] (0/4) Epoch 10, batch 3800, loss[loss=0.2426, simple_loss=0.3339, pruned_loss=0.07562, over 21459.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.2964, pruned_loss=0.0714, over 4288191.43 frames. ], batch size: 131, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 22:46:58,181 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.737e+02 5.847e+02 8.030e+02 1.160e+03 2.493e+03, threshold=1.606e+03, percent-clipped=8.0 2023-06-26 22:47:10,259 INFO [train.py:996] (0/4) Epoch 10, batch 3850, loss[loss=0.2008, simple_loss=0.2779, pruned_loss=0.06181, over 15156.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.2935, pruned_loss=0.07145, over 4284584.93 frames. ], batch size: 60, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 22:47:11,303 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.80 vs. limit=15.0 2023-06-26 22:47:46,680 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1669872.0, ans=0.125 2023-06-26 22:48:32,933 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.84 vs. limit=15.0 2023-06-26 22:48:51,924 INFO [train.py:996] (0/4) Epoch 10, batch 3900, loss[loss=0.1761, simple_loss=0.2296, pruned_loss=0.06134, over 20704.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2892, pruned_loss=0.07083, over 4266273.02 frames. ], batch size: 607, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 22:49:13,575 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1670172.0, ans=0.0 2023-06-26 22:49:13,579 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1670172.0, ans=0.0 2023-06-26 22:50:03,199 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=3.90 vs. limit=15.0 2023-06-26 22:50:19,942 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1670292.0, ans=0.1 2023-06-26 22:50:28,476 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1670352.0, ans=0.125 2023-06-26 22:50:39,678 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.32 vs. limit=10.0 2023-06-26 22:50:40,117 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.341e+02 6.738e+02 9.125e+02 1.558e+03 3.098e+03, threshold=1.825e+03, percent-clipped=22.0 2023-06-26 22:50:47,247 INFO [train.py:996] (0/4) Epoch 10, batch 3950, loss[loss=0.1878, simple_loss=0.2859, pruned_loss=0.04487, over 21728.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.292, pruned_loss=0.07011, over 4270425.72 frames. ], batch size: 351, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 22:51:21,507 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1670472.0, ans=0.125 2023-06-26 22:51:32,158 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1670532.0, ans=0.125 2023-06-26 22:51:54,524 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.30 vs. limit=10.0 2023-06-26 22:52:10,610 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1670592.0, ans=0.125 2023-06-26 22:52:24,802 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1670652.0, ans=0.125 2023-06-26 22:52:35,965 INFO [train.py:996] (0/4) Epoch 10, batch 4000, loss[loss=0.1774, simple_loss=0.2503, pruned_loss=0.05227, over 21712.00 frames. ], tot_loss[loss=0.2108, simple_loss=0.2865, pruned_loss=0.06751, over 4271129.46 frames. ], batch size: 316, lr: 2.99e-03, grad_scale: 32.0 2023-06-26 22:53:39,963 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1670832.0, ans=0.95 2023-06-26 22:53:52,688 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.83 vs. limit=12.0 2023-06-26 22:54:19,973 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.363e+02 6.033e+02 8.423e+02 1.568e+03 3.555e+03, threshold=1.685e+03, percent-clipped=19.0 2023-06-26 22:54:31,309 INFO [train.py:996] (0/4) Epoch 10, batch 4050, loss[loss=0.2007, simple_loss=0.2628, pruned_loss=0.06928, over 21494.00 frames. ], tot_loss[loss=0.2085, simple_loss=0.2845, pruned_loss=0.06621, over 4276816.49 frames. ], batch size: 548, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 22:55:26,564 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.66 vs. limit=15.0 2023-06-26 22:55:38,222 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1671192.0, ans=0.125 2023-06-26 22:55:45,194 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1671192.0, ans=0.125 2023-06-26 22:55:57,390 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1671252.0, ans=0.0 2023-06-26 22:55:59,469 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1671252.0, ans=0.1 2023-06-26 22:56:00,116 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.82 vs. limit=10.0 2023-06-26 22:56:20,349 INFO [train.py:996] (0/4) Epoch 10, batch 4100, loss[loss=0.1889, simple_loss=0.2758, pruned_loss=0.05104, over 21724.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.2887, pruned_loss=0.06655, over 4266345.97 frames. ], batch size: 247, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 22:57:56,498 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 22:57:57,625 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.779e+02 5.678e+02 9.516e+02 1.395e+03 3.425e+03, threshold=1.903e+03, percent-clipped=17.0 2023-06-26 22:58:00,071 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 22:58:02,760 INFO [train.py:996] (0/4) Epoch 10, batch 4150, loss[loss=0.1744, simple_loss=0.2591, pruned_loss=0.04486, over 21801.00 frames. ], tot_loss[loss=0.2079, simple_loss=0.2893, pruned_loss=0.06332, over 4274605.72 frames. ], batch size: 118, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 22:58:10,358 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1671612.0, ans=0.1 2023-06-26 22:58:52,642 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1671732.0, ans=0.0 2023-06-26 22:59:48,066 INFO [train.py:996] (0/4) Epoch 10, batch 4200, loss[loss=0.2323, simple_loss=0.3071, pruned_loss=0.07874, over 21902.00 frames. ], tot_loss[loss=0.2086, simple_loss=0.2894, pruned_loss=0.06393, over 4277972.19 frames. ], batch size: 373, lr: 2.99e-03, grad_scale: 8.0 2023-06-26 22:59:57,873 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1671912.0, ans=0.0 2023-06-26 23:00:15,742 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1671972.0, ans=0.125 2023-06-26 23:00:20,864 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1671972.0, ans=0.0 2023-06-26 23:00:22,698 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1671972.0, ans=0.125 2023-06-26 23:00:22,719 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1671972.0, ans=0.125 2023-06-26 23:00:35,617 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.66 vs. limit=10.0 2023-06-26 23:01:14,141 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 23:01:16,334 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1672152.0, ans=0.125 2023-06-26 23:01:29,528 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.633e+02 4.957e+02 6.956e+02 1.176e+03 3.842e+03, threshold=1.391e+03, percent-clipped=7.0 2023-06-26 23:01:33,262 INFO [train.py:996] (0/4) Epoch 10, batch 4250, loss[loss=0.2359, simple_loss=0.3212, pruned_loss=0.07529, over 21273.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2941, pruned_loss=0.06556, over 4272720.02 frames. ], batch size: 131, lr: 2.99e-03, grad_scale: 8.0 2023-06-26 23:02:59,790 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1672452.0, ans=0.125 2023-06-26 23:03:27,722 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=1672452.0, ans=15.0 2023-06-26 23:03:30,181 INFO [train.py:996] (0/4) Epoch 10, batch 4300, loss[loss=0.2085, simple_loss=0.3, pruned_loss=0.05856, over 21665.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.3008, pruned_loss=0.06751, over 4264947.73 frames. ], batch size: 263, lr: 2.99e-03, grad_scale: 8.0 2023-06-26 23:04:04,541 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1672572.0, ans=0.0 2023-06-26 23:04:20,276 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1672632.0, ans=0.2 2023-06-26 23:04:29,399 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1672632.0, ans=0.125 2023-06-26 23:04:52,590 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1672692.0, ans=0.125 2023-06-26 23:05:15,317 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.979e+02 6.207e+02 8.849e+02 1.440e+03 4.327e+03, threshold=1.770e+03, percent-clipped=25.0 2023-06-26 23:05:18,710 INFO [train.py:996] (0/4) Epoch 10, batch 4350, loss[loss=0.1919, simple_loss=0.2707, pruned_loss=0.05659, over 21369.00 frames. ], tot_loss[loss=0.218, simple_loss=0.3005, pruned_loss=0.06776, over 4265141.95 frames. ], batch size: 211, lr: 2.99e-03, grad_scale: 8.0 2023-06-26 23:05:24,711 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1672812.0, ans=0.0 2023-06-26 23:05:59,095 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1672872.0, ans=0.0 2023-06-26 23:06:08,786 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.37 vs. limit=15.0 2023-06-26 23:06:09,471 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1672932.0, ans=0.125 2023-06-26 23:06:11,094 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1672932.0, ans=0.125 2023-06-26 23:07:07,237 INFO [train.py:996] (0/4) Epoch 10, batch 4400, loss[loss=0.1846, simple_loss=0.2609, pruned_loss=0.0542, over 21462.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2958, pruned_loss=0.06723, over 4268671.46 frames. ], batch size: 212, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:07:20,852 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.42 vs. limit=15.0 2023-06-26 23:07:42,308 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1673172.0, ans=0.1 2023-06-26 23:08:15,695 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.62 vs. limit=22.5 2023-06-26 23:08:33,943 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1673292.0, ans=0.1 2023-06-26 23:08:34,408 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.82 vs. limit=10.0 2023-06-26 23:08:43,363 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.06 vs. limit=15.0 2023-06-26 23:08:52,418 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.985e+02 5.856e+02 8.779e+02 1.198e+03 2.482e+03, threshold=1.756e+03, percent-clipped=8.0 2023-06-26 23:08:56,199 INFO [train.py:996] (0/4) Epoch 10, batch 4450, loss[loss=0.2406, simple_loss=0.3381, pruned_loss=0.07154, over 21406.00 frames. ], tot_loss[loss=0.22, simple_loss=0.303, pruned_loss=0.06856, over 4275459.56 frames. ], batch size: 211, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:10:03,654 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1673592.0, ans=0.125 2023-06-26 23:10:03,665 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1673592.0, ans=0.2 2023-06-26 23:10:17,688 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1673592.0, ans=0.125 2023-06-26 23:10:34,973 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1673652.0, ans=0.025 2023-06-26 23:10:45,082 INFO [train.py:996] (0/4) Epoch 10, batch 4500, loss[loss=0.2051, simple_loss=0.2902, pruned_loss=0.05999, over 21342.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.3032, pruned_loss=0.07007, over 4278808.63 frames. ], batch size: 176, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:10:47,393 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1673712.0, ans=0.125 2023-06-26 23:11:16,315 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1673772.0, ans=0.05 2023-06-26 23:11:42,356 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1673832.0, ans=0.125 2023-06-26 23:12:12,326 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1673892.0, ans=0.0 2023-06-26 23:12:17,656 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1673952.0, ans=0.0 2023-06-26 23:12:31,074 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.21 vs. limit=15.0 2023-06-26 23:12:31,383 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.557e+02 6.437e+02 9.027e+02 1.407e+03 3.220e+03, threshold=1.805e+03, percent-clipped=13.0 2023-06-26 23:12:46,676 INFO [train.py:996] (0/4) Epoch 10, batch 4550, loss[loss=0.2316, simple_loss=0.3189, pruned_loss=0.0722, over 21689.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.3059, pruned_loss=0.07064, over 4283262.16 frames. ], batch size: 112, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:12:49,781 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.11 vs. limit=6.0 2023-06-26 23:13:07,230 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.25 vs. limit=15.0 2023-06-26 23:13:08,319 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1674072.0, ans=0.1 2023-06-26 23:13:21,117 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1674072.0, ans=0.0 2023-06-26 23:13:33,513 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1674132.0, ans=0.0 2023-06-26 23:13:49,206 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1674192.0, ans=0.125 2023-06-26 23:13:57,879 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1674192.0, ans=0.125 2023-06-26 23:14:05,216 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.28 vs. limit=15.0 2023-06-26 23:14:34,567 INFO [train.py:996] (0/4) Epoch 10, batch 4600, loss[loss=0.2345, simple_loss=0.3074, pruned_loss=0.0808, over 21761.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3112, pruned_loss=0.07287, over 4285426.75 frames. ], batch size: 441, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:14:49,178 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1674312.0, ans=0.04949747468305833 2023-06-26 23:14:54,167 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1674372.0, ans=0.125 2023-06-26 23:15:30,796 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1674432.0, ans=0.1 2023-06-26 23:15:37,716 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1674492.0, ans=0.0 2023-06-26 23:15:48,200 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1674492.0, ans=0.05 2023-06-26 23:16:17,989 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.163e+02 6.181e+02 9.452e+02 1.480e+03 3.323e+03, threshold=1.890e+03, percent-clipped=16.0 2023-06-26 23:16:21,542 INFO [train.py:996] (0/4) Epoch 10, batch 4650, loss[loss=0.201, simple_loss=0.2765, pruned_loss=0.06273, over 21908.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.3061, pruned_loss=0.07154, over 4285771.39 frames. ], batch size: 351, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:16:31,343 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.53 vs. limit=15.0 2023-06-26 23:17:06,663 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1674732.0, ans=0.0 2023-06-26 23:17:39,535 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1674852.0, ans=0.125 2023-06-26 23:18:08,103 INFO [train.py:996] (0/4) Epoch 10, batch 4700, loss[loss=0.1827, simple_loss=0.2565, pruned_loss=0.05443, over 21825.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2969, pruned_loss=0.0693, over 4285955.90 frames. ], batch size: 107, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:18:20,843 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 23:18:41,441 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1674972.0, ans=0.0 2023-06-26 23:18:56,478 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1675032.0, ans=0.125 2023-06-26 23:19:10,379 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.19 vs. limit=6.0 2023-06-26 23:19:20,962 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.16 vs. limit=15.0 2023-06-26 23:19:35,560 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1675152.0, ans=0.125 2023-06-26 23:19:50,878 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.765e+02 4.747e+02 5.523e+02 7.889e+02 1.677e+03, threshold=1.105e+03, percent-clipped=0.0 2023-06-26 23:19:54,061 INFO [train.py:996] (0/4) Epoch 10, batch 4750, loss[loss=0.225, simple_loss=0.2926, pruned_loss=0.07875, over 21745.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2911, pruned_loss=0.06952, over 4284093.11 frames. ], batch size: 415, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:20:51,875 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1675332.0, ans=0.1 2023-06-26 23:20:55,431 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=1675392.0, ans=0.05 2023-06-26 23:20:58,888 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1675392.0, ans=0.125 2023-06-26 23:21:00,622 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1675392.0, ans=0.0 2023-06-26 23:21:41,799 INFO [train.py:996] (0/4) Epoch 10, batch 4800, loss[loss=0.1784, simple_loss=0.239, pruned_loss=0.0589, over 21208.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.29, pruned_loss=0.06913, over 4286164.90 frames. ], batch size: 548, lr: 2.99e-03, grad_scale: 32.0 2023-06-26 23:23:21,030 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1675752.0, ans=0.125 2023-06-26 23:23:25,456 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.997e+02 5.704e+02 8.592e+02 1.252e+03 2.093e+03, threshold=1.718e+03, percent-clipped=31.0 2023-06-26 23:23:27,163 INFO [train.py:996] (0/4) Epoch 10, batch 4850, loss[loss=0.2094, simple_loss=0.2891, pruned_loss=0.06487, over 21866.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2883, pruned_loss=0.06848, over 4289793.20 frames. ], batch size: 124, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:23:32,880 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1675812.0, ans=0.1 2023-06-26 23:24:53,547 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1676052.0, ans=0.0 2023-06-26 23:25:10,932 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1676052.0, ans=0.125 2023-06-26 23:25:15,491 INFO [train.py:996] (0/4) Epoch 10, batch 4900, loss[loss=0.2471, simple_loss=0.3816, pruned_loss=0.05631, over 20765.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2907, pruned_loss=0.06906, over 4290502.82 frames. ], batch size: 607, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:25:33,563 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1676112.0, ans=0.125 2023-06-26 23:25:50,980 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 23:26:35,287 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1676292.0, ans=0.125 2023-06-26 23:27:07,352 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.910e+02 6.746e+02 9.232e+02 1.272e+03 2.922e+03, threshold=1.846e+03, percent-clipped=7.0 2023-06-26 23:27:08,934 INFO [train.py:996] (0/4) Epoch 10, batch 4950, loss[loss=0.1766, simple_loss=0.2731, pruned_loss=0.04006, over 21732.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.2955, pruned_loss=0.06713, over 4290177.68 frames. ], batch size: 351, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:27:46,935 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1676532.0, ans=0.125 2023-06-26 23:27:48,370 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1676532.0, ans=0.125 2023-06-26 23:27:48,482 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1676532.0, ans=0.09899494936611666 2023-06-26 23:27:51,168 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.90 vs. limit=22.5 2023-06-26 23:28:23,484 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1676592.0, ans=0.1 2023-06-26 23:28:43,166 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1676652.0, ans=0.1 2023-06-26 23:28:50,826 INFO [train.py:996] (0/4) Epoch 10, batch 5000, loss[loss=0.2142, simple_loss=0.2766, pruned_loss=0.07585, over 20113.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2952, pruned_loss=0.06498, over 4291955.53 frames. ], batch size: 703, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:29:10,901 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1676712.0, ans=0.125 2023-06-26 23:30:35,685 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.540e+02 5.923e+02 8.910e+02 1.386e+03 2.915e+03, threshold=1.782e+03, percent-clipped=9.0 2023-06-26 23:30:37,444 INFO [train.py:996] (0/4) Epoch 10, batch 5050, loss[loss=0.2155, simple_loss=0.2918, pruned_loss=0.06956, over 21436.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.2944, pruned_loss=0.06593, over 4284574.76 frames. ], batch size: 211, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:31:25,427 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1677132.0, ans=0.125 2023-06-26 23:31:35,914 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.03 vs. limit=22.5 2023-06-26 23:31:38,968 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1677192.0, ans=0.125 2023-06-26 23:31:50,584 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1677192.0, ans=0.125 2023-06-26 23:31:53,824 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1677192.0, ans=0.0 2023-06-26 23:31:55,417 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1677192.0, ans=0.125 2023-06-26 23:32:12,795 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1677252.0, ans=0.125 2023-06-26 23:32:22,445 INFO [train.py:996] (0/4) Epoch 10, batch 5100, loss[loss=0.1719, simple_loss=0.2547, pruned_loss=0.04457, over 21299.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2907, pruned_loss=0.06631, over 4288253.13 frames. ], batch size: 176, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:32:51,887 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1677372.0, ans=0.125 2023-06-26 23:33:00,377 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1677372.0, ans=0.0 2023-06-26 23:33:57,704 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1677552.0, ans=0.125 2023-06-26 23:34:07,926 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.053e+02 6.342e+02 8.169e+02 1.053e+03 2.713e+03, threshold=1.634e+03, percent-clipped=6.0 2023-06-26 23:34:09,481 INFO [train.py:996] (0/4) Epoch 10, batch 5150, loss[loss=0.2761, simple_loss=0.3936, pruned_loss=0.07925, over 20765.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.291, pruned_loss=0.06728, over 4293570.90 frames. ], batch size: 607, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:34:11,938 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1677612.0, ans=0.125 2023-06-26 23:34:42,328 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1677672.0, ans=0.125 2023-06-26 23:36:03,522 INFO [train.py:996] (0/4) Epoch 10, batch 5200, loss[loss=0.2039, simple_loss=0.3027, pruned_loss=0.05256, over 21612.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2912, pruned_loss=0.06774, over 4282065.58 frames. ], batch size: 230, lr: 2.99e-03, grad_scale: 32.0 2023-06-26 23:36:25,548 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1677972.0, ans=0.2 2023-06-26 23:36:35,501 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1677972.0, ans=0.035 2023-06-26 23:36:46,164 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1678032.0, ans=0.0 2023-06-26 23:37:50,418 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.940e+02 5.817e+02 8.011e+02 1.324e+03 3.418e+03, threshold=1.602e+03, percent-clipped=14.0 2023-06-26 23:37:50,456 INFO [train.py:996] (0/4) Epoch 10, batch 5250, loss[loss=0.1979, simple_loss=0.2835, pruned_loss=0.05617, over 21560.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2956, pruned_loss=0.06669, over 4280118.06 frames. ], batch size: 230, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:38:12,833 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1678272.0, ans=0.1 2023-06-26 23:38:18,214 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.49 vs. limit=15.0 2023-06-26 23:38:22,767 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1678272.0, ans=0.125 2023-06-26 23:38:34,713 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1678332.0, ans=0.2 2023-06-26 23:38:44,081 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.61 vs. limit=22.5 2023-06-26 23:39:19,044 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1678452.0, ans=0.1 2023-06-26 23:39:35,325 INFO [train.py:996] (0/4) Epoch 10, batch 5300, loss[loss=0.2036, simple_loss=0.2814, pruned_loss=0.06285, over 21461.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.2949, pruned_loss=0.06719, over 4289437.41 frames. ], batch size: 194, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:39:43,614 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.08 vs. limit=10.0 2023-06-26 23:39:44,907 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.77 vs. limit=22.5 2023-06-26 23:39:51,628 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1678512.0, ans=0.035 2023-06-26 23:39:53,409 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1678512.0, ans=0.2 2023-06-26 23:40:10,763 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.58 vs. limit=15.0 2023-06-26 23:40:41,217 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=1678692.0, ans=0.05 2023-06-26 23:41:04,881 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1678752.0, ans=0.0 2023-06-26 23:41:16,716 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1678752.0, ans=0.125 2023-06-26 23:41:21,208 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.821e+02 5.421e+02 7.005e+02 9.056e+02 1.380e+03, threshold=1.401e+03, percent-clipped=0.0 2023-06-26 23:41:21,241 INFO [train.py:996] (0/4) Epoch 10, batch 5350, loss[loss=0.1992, simple_loss=0.2661, pruned_loss=0.06619, over 21912.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2925, pruned_loss=0.06808, over 4294485.47 frames. ], batch size: 316, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:41:21,907 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff3.min_abs, batch_count=1678812.0, ans=0.2 2023-06-26 23:42:36,223 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1678992.0, ans=0.0 2023-06-26 23:42:58,144 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1679052.0, ans=0.0 2023-06-26 23:42:58,805 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.57 vs. limit=15.0 2023-06-26 23:43:05,922 INFO [train.py:996] (0/4) Epoch 10, batch 5400, loss[loss=0.178, simple_loss=0.2268, pruned_loss=0.06461, over 20753.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2908, pruned_loss=0.0689, over 4294845.62 frames. ], batch size: 608, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:44:20,787 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.70 vs. limit=15.0 2023-06-26 23:44:21,970 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1679292.0, ans=0.1 2023-06-26 23:44:33,904 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.17 vs. limit=15.0 2023-06-26 23:44:45,756 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1679352.0, ans=0.1 2023-06-26 23:44:49,370 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1679352.0, ans=0.1 2023-06-26 23:44:53,969 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.666e+02 6.862e+02 1.175e+03 1.926e+03 4.033e+03, threshold=2.351e+03, percent-clipped=41.0 2023-06-26 23:44:54,012 INFO [train.py:996] (0/4) Epoch 10, batch 5450, loss[loss=0.2264, simple_loss=0.3299, pruned_loss=0.06148, over 21790.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2925, pruned_loss=0.0678, over 4295455.55 frames. ], batch size: 298, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:45:15,613 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1679472.0, ans=0.125 2023-06-26 23:45:18,005 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.06 vs. limit=8.0 2023-06-26 23:45:49,047 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1679532.0, ans=0.0 2023-06-26 23:45:50,809 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1679532.0, ans=0.2 2023-06-26 23:46:24,223 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 23:46:50,776 INFO [train.py:996] (0/4) Epoch 10, batch 5500, loss[loss=0.1678, simple_loss=0.2736, pruned_loss=0.03101, over 21683.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2978, pruned_loss=0.06515, over 4290375.84 frames. ], batch size: 298, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:46:57,349 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.20 vs. limit=15.0 2023-06-26 23:48:07,623 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1679892.0, ans=0.125 2023-06-26 23:48:36,319 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-280000.pt 2023-06-26 23:48:48,469 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.727e+02 5.357e+02 7.450e+02 1.317e+03 3.051e+03, threshold=1.490e+03, percent-clipped=6.0 2023-06-26 23:48:48,507 INFO [train.py:996] (0/4) Epoch 10, batch 5550, loss[loss=0.1621, simple_loss=0.2373, pruned_loss=0.04346, over 21080.00 frames. ], tot_loss[loss=0.211, simple_loss=0.2967, pruned_loss=0.06267, over 4291848.03 frames. ], batch size: 143, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:49:26,522 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.97 vs. limit=22.5 2023-06-26 23:50:01,941 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1680192.0, ans=0.125 2023-06-26 23:50:38,667 INFO [train.py:996] (0/4) Epoch 10, batch 5600, loss[loss=0.2333, simple_loss=0.3309, pruned_loss=0.06784, over 21823.00 frames. ], tot_loss[loss=0.2084, simple_loss=0.2959, pruned_loss=0.06047, over 4284178.70 frames. ], batch size: 316, lr: 2.99e-03, grad_scale: 32.0 2023-06-26 23:51:10,476 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1680372.0, ans=0.125 2023-06-26 23:51:19,307 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1680432.0, ans=0.2 2023-06-26 23:51:36,579 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=6.46 vs. limit=22.5 2023-06-26 23:51:52,147 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.05 vs. limit=15.0 2023-06-26 23:52:25,091 INFO [train.py:996] (0/4) Epoch 10, batch 5650, loss[loss=0.2382, simple_loss=0.3109, pruned_loss=0.08274, over 21720.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.3012, pruned_loss=0.06307, over 4288909.92 frames. ], batch size: 441, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:52:27,143 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.741e+02 5.468e+02 7.224e+02 1.167e+03 2.877e+03, threshold=1.445e+03, percent-clipped=12.0 2023-06-26 23:52:56,351 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn1.whiten.whitening_limit, batch_count=1680672.0, ans=22.5 2023-06-26 23:52:56,351 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.68 vs. limit=22.5 2023-06-26 23:53:14,519 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1680732.0, ans=0.0 2023-06-26 23:53:16,230 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1680732.0, ans=0.1 2023-06-26 23:53:18,397 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.42 vs. limit=15.0 2023-06-26 23:54:03,626 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1680852.0, ans=0.1 2023-06-26 23:54:13,515 INFO [train.py:996] (0/4) Epoch 10, batch 5700, loss[loss=0.2032, simple_loss=0.3059, pruned_loss=0.05025, over 21650.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2998, pruned_loss=0.06443, over 4294834.53 frames. ], batch size: 389, lr: 2.98e-03, grad_scale: 16.0 2023-06-26 23:54:24,202 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.25 vs. limit=15.0 2023-06-26 23:56:09,503 INFO [train.py:996] (0/4) Epoch 10, batch 5750, loss[loss=0.2086, simple_loss=0.3088, pruned_loss=0.0542, over 21186.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.2956, pruned_loss=0.06152, over 4292876.40 frames. ], batch size: 548, lr: 2.98e-03, grad_scale: 16.0 2023-06-26 23:56:11,425 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.677e+02 6.670e+02 9.043e+02 1.357e+03 3.417e+03, threshold=1.809e+03, percent-clipped=19.0 2023-06-26 23:56:45,967 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1681272.0, ans=0.125 2023-06-26 23:56:54,713 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1681332.0, ans=0.2 2023-06-26 23:57:43,168 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1681452.0, ans=0.125 2023-06-26 23:57:45,127 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 23:57:58,041 INFO [train.py:996] (0/4) Epoch 10, batch 5800, loss[loss=0.21, simple_loss=0.3095, pruned_loss=0.05519, over 21749.00 frames. ], tot_loss[loss=0.2074, simple_loss=0.2945, pruned_loss=0.06018, over 4289872.70 frames. ], batch size: 298, lr: 2.98e-03, grad_scale: 16.0 2023-06-26 23:59:01,590 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1681632.0, ans=0.0 2023-06-26 23:59:03,135 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1681632.0, ans=0.125 2023-06-26 23:59:21,023 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1681692.0, ans=0.07 2023-06-26 23:59:46,320 INFO [train.py:996] (0/4) Epoch 10, batch 5850, loss[loss=0.2192, simple_loss=0.3115, pruned_loss=0.0635, over 21479.00 frames. ], tot_loss[loss=0.2036, simple_loss=0.2931, pruned_loss=0.05703, over 4286523.09 frames. ], batch size: 507, lr: 2.98e-03, grad_scale: 16.0 2023-06-26 23:59:53,509 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.721e+02 4.995e+02 7.881e+02 1.168e+03 2.434e+03, threshold=1.576e+03, percent-clipped=1.0 2023-06-27 00:00:29,369 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.38 vs. limit=15.0 2023-06-27 00:01:16,498 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1682052.0, ans=0.125 2023-06-27 00:01:37,813 INFO [train.py:996] (0/4) Epoch 10, batch 5900, loss[loss=0.161, simple_loss=0.2351, pruned_loss=0.04344, over 16525.00 frames. ], tot_loss[loss=0.1963, simple_loss=0.2867, pruned_loss=0.05296, over 4278645.63 frames. ], batch size: 61, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 00:01:45,566 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1682112.0, ans=0.1 2023-06-27 00:02:41,717 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1682232.0, ans=0.125 2023-06-27 00:03:02,090 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1682352.0, ans=0.1 2023-06-27 00:03:21,411 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1682352.0, ans=0.125 2023-06-27 00:03:24,111 INFO [train.py:996] (0/4) Epoch 10, batch 5950, loss[loss=0.2175, simple_loss=0.2857, pruned_loss=0.07459, over 21869.00 frames. ], tot_loss[loss=0.1972, simple_loss=0.284, pruned_loss=0.05525, over 4282675.26 frames. ], batch size: 414, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 00:03:25,859 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.299e+02 4.862e+02 7.145e+02 9.461e+02 2.592e+03, threshold=1.429e+03, percent-clipped=2.0 2023-06-27 00:04:22,723 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1682532.0, ans=0.1 2023-06-27 00:04:37,805 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1682592.0, ans=0.125 2023-06-27 00:04:45,771 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1682652.0, ans=0.035 2023-06-27 00:04:49,020 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1682652.0, ans=0.125 2023-06-27 00:05:08,668 INFO [train.py:996] (0/4) Epoch 10, batch 6000, loss[loss=0.1693, simple_loss=0.2282, pruned_loss=0.05523, over 21241.00 frames. ], tot_loss[loss=0.1979, simple_loss=0.2796, pruned_loss=0.05808, over 4280893.64 frames. ], batch size: 548, lr: 2.98e-03, grad_scale: 32.0 2023-06-27 00:05:08,669 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-27 00:05:29,815 INFO [train.py:1028] (0/4) Epoch 10, validation: loss=0.2604, simple_loss=0.3533, pruned_loss=0.08374, over 1796401.00 frames. 2023-06-27 00:05:29,816 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23714MB 2023-06-27 00:06:13,497 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.96 vs. limit=12.0 2023-06-27 00:06:23,491 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1682832.0, ans=0.125 2023-06-27 00:07:18,980 INFO [train.py:996] (0/4) Epoch 10, batch 6050, loss[loss=0.2108, simple_loss=0.2762, pruned_loss=0.07274, over 15337.00 frames. ], tot_loss[loss=0.1973, simple_loss=0.2752, pruned_loss=0.05974, over 4261431.43 frames. ], batch size: 60, lr: 2.98e-03, grad_scale: 8.0 2023-06-27 00:07:24,243 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.965e+02 5.435e+02 7.983e+02 1.281e+03 2.662e+03, threshold=1.597e+03, percent-clipped=18.0 2023-06-27 00:07:37,301 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1683072.0, ans=0.0 2023-06-27 00:07:57,815 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1683132.0, ans=0.0 2023-06-27 00:08:41,123 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1683252.0, ans=0.1 2023-06-27 00:08:43,530 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.50 vs. limit=15.0 2023-06-27 00:09:06,592 INFO [train.py:996] (0/4) Epoch 10, batch 6100, loss[loss=0.2012, simple_loss=0.2744, pruned_loss=0.06403, over 21788.00 frames. ], tot_loss[loss=0.1967, simple_loss=0.2758, pruned_loss=0.05876, over 4261672.80 frames. ], batch size: 247, lr: 2.98e-03, grad_scale: 8.0 2023-06-27 00:09:12,793 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.44 vs. limit=15.0 2023-06-27 00:10:42,284 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.85 vs. limit=15.0 2023-06-27 00:10:53,288 INFO [train.py:996] (0/4) Epoch 10, batch 6150, loss[loss=0.2329, simple_loss=0.3073, pruned_loss=0.07926, over 21513.00 frames. ], tot_loss[loss=0.2, simple_loss=0.2788, pruned_loss=0.06062, over 4263160.85 frames. ], batch size: 473, lr: 2.98e-03, grad_scale: 8.0 2023-06-27 00:10:58,724 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.616e+02 5.589e+02 9.647e+02 1.302e+03 3.090e+03, threshold=1.929e+03, percent-clipped=16.0 2023-06-27 00:10:59,843 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.35 vs. limit=6.0 2023-06-27 00:11:33,985 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1683732.0, ans=0.0 2023-06-27 00:11:48,007 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1683732.0, ans=0.125 2023-06-27 00:11:48,076 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1683732.0, ans=0.2 2023-06-27 00:11:48,746 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.54 vs. limit=15.0 2023-06-27 00:11:56,867 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1683792.0, ans=0.2 2023-06-27 00:12:32,267 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1683852.0, ans=0.0 2023-06-27 00:12:42,252 INFO [train.py:996] (0/4) Epoch 10, batch 6200, loss[loss=0.223, simple_loss=0.2981, pruned_loss=0.07393, over 21451.00 frames. ], tot_loss[loss=0.2033, simple_loss=0.2834, pruned_loss=0.06156, over 4256939.41 frames. ], batch size: 131, lr: 2.98e-03, grad_scale: 8.0 2023-06-27 00:13:46,561 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1684092.0, ans=0.125 2023-06-27 00:14:19,998 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1684152.0, ans=0.0 2023-06-27 00:14:31,239 INFO [train.py:996] (0/4) Epoch 10, batch 6250, loss[loss=0.1891, simple_loss=0.2889, pruned_loss=0.04469, over 21657.00 frames. ], tot_loss[loss=0.2075, simple_loss=0.29, pruned_loss=0.06247, over 4254504.71 frames. ], batch size: 247, lr: 2.98e-03, grad_scale: 8.0 2023-06-27 00:14:36,260 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.907e+02 5.995e+02 9.540e+02 1.636e+03 4.135e+03, threshold=1.908e+03, percent-clipped=20.0 2023-06-27 00:14:47,147 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1684212.0, ans=0.0 2023-06-27 00:14:51,184 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.10 vs. limit=15.0 2023-06-27 00:15:14,179 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1684332.0, ans=0.125 2023-06-27 00:16:03,906 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.15 vs. limit=15.0 2023-06-27 00:16:16,378 INFO [train.py:996] (0/4) Epoch 10, batch 6300, loss[loss=0.2004, simple_loss=0.3132, pruned_loss=0.04379, over 19803.00 frames. ], tot_loss[loss=0.2078, simple_loss=0.293, pruned_loss=0.06133, over 4256027.63 frames. ], batch size: 703, lr: 2.98e-03, grad_scale: 8.0 2023-06-27 00:16:44,700 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1684572.0, ans=0.125 2023-06-27 00:17:09,079 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1684632.0, ans=0.1 2023-06-27 00:18:08,449 INFO [train.py:996] (0/4) Epoch 10, batch 6350, loss[loss=0.2562, simple_loss=0.325, pruned_loss=0.09373, over 21288.00 frames. ], tot_loss[loss=0.211, simple_loss=0.295, pruned_loss=0.06344, over 4261818.39 frames. ], batch size: 143, lr: 2.98e-03, grad_scale: 8.0 2023-06-27 00:18:13,742 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.722e+02 5.276e+02 6.494e+02 9.126e+02 1.517e+03, threshold=1.299e+03, percent-clipped=0.0 2023-06-27 00:18:18,357 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1684812.0, ans=0.05 2023-06-27 00:19:48,229 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=1685052.0, ans=0.05 2023-06-27 00:19:52,401 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.86 vs. limit=15.0 2023-06-27 00:19:57,970 INFO [train.py:996] (0/4) Epoch 10, batch 6400, loss[loss=0.2452, simple_loss=0.3149, pruned_loss=0.08773, over 21520.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2987, pruned_loss=0.06723, over 4263013.12 frames. ], batch size: 194, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 00:21:51,050 INFO [train.py:996] (0/4) Epoch 10, batch 6450, loss[loss=0.1998, simple_loss=0.2925, pruned_loss=0.05355, over 21771.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.302, pruned_loss=0.06784, over 4267834.90 frames. ], batch size: 316, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 00:21:55,953 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.237e+02 6.943e+02 1.024e+03 1.521e+03 2.741e+03, threshold=2.048e+03, percent-clipped=32.0 2023-06-27 00:21:58,133 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1685412.0, ans=0.125 2023-06-27 00:23:07,123 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.50 vs. limit=15.0 2023-06-27 00:23:37,648 INFO [train.py:996] (0/4) Epoch 10, batch 6500, loss[loss=0.2071, simple_loss=0.2683, pruned_loss=0.07294, over 21592.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2948, pruned_loss=0.06673, over 4275195.09 frames. ], batch size: 415, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 00:23:46,742 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1685712.0, ans=0.125 2023-06-27 00:24:56,434 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1685892.0, ans=0.1 2023-06-27 00:25:08,249 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1685952.0, ans=0.1 2023-06-27 00:25:13,422 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1685952.0, ans=0.125 2023-06-27 00:25:23,187 INFO [train.py:996] (0/4) Epoch 10, batch 6550, loss[loss=0.202, simple_loss=0.2781, pruned_loss=0.06291, over 21411.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.292, pruned_loss=0.06533, over 4276967.61 frames. ], batch size: 211, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 00:25:28,449 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.027e+02 5.505e+02 8.547e+02 1.330e+03 2.902e+03, threshold=1.709e+03, percent-clipped=6.0 2023-06-27 00:25:47,073 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1686072.0, ans=0.0 2023-06-27 00:25:50,378 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1686072.0, ans=0.125 2023-06-27 00:27:09,106 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1686312.0, ans=0.125 2023-06-27 00:27:10,179 INFO [train.py:996] (0/4) Epoch 10, batch 6600, loss[loss=0.217, simple_loss=0.2657, pruned_loss=0.0842, over 21405.00 frames. ], tot_loss[loss=0.2085, simple_loss=0.2865, pruned_loss=0.06522, over 4270896.39 frames. ], batch size: 508, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 00:27:39,059 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1686372.0, ans=0.1 2023-06-27 00:27:49,678 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.03 vs. limit=15.0 2023-06-27 00:28:14,697 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1686432.0, ans=0.0 2023-06-27 00:28:26,187 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1686492.0, ans=0.125 2023-06-27 00:28:35,271 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1686552.0, ans=0.04949747468305833 2023-06-27 00:28:48,158 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.50 vs. limit=15.0 2023-06-27 00:28:50,897 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1686552.0, ans=0.125 2023-06-27 00:28:56,661 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.24 vs. limit=15.0 2023-06-27 00:28:57,109 INFO [train.py:996] (0/4) Epoch 10, batch 6650, loss[loss=0.1737, simple_loss=0.2535, pruned_loss=0.04692, over 21654.00 frames. ], tot_loss[loss=0.2026, simple_loss=0.2796, pruned_loss=0.0628, over 4270241.93 frames. ], batch size: 298, lr: 2.98e-03, grad_scale: 8.0 2023-06-27 00:29:09,396 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.441e+02 5.556e+02 7.751e+02 1.155e+03 2.381e+03, threshold=1.550e+03, percent-clipped=8.0 2023-06-27 00:29:40,903 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1686732.0, ans=0.0 2023-06-27 00:29:43,113 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.12 vs. limit=15.0 2023-06-27 00:29:50,150 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.47 vs. limit=15.0 2023-06-27 00:30:06,703 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1686792.0, ans=0.125 2023-06-27 00:30:12,291 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.70 vs. limit=15.0 2023-06-27 00:30:18,711 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1686792.0, ans=0.125 2023-06-27 00:30:48,139 INFO [train.py:996] (0/4) Epoch 10, batch 6700, loss[loss=0.1778, simple_loss=0.2452, pruned_loss=0.05519, over 21833.00 frames. ], tot_loss[loss=0.2008, simple_loss=0.276, pruned_loss=0.06283, over 4263311.82 frames. ], batch size: 98, lr: 2.98e-03, grad_scale: 8.0 2023-06-27 00:30:57,167 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1686912.0, ans=0.1 2023-06-27 00:31:03,751 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1686972.0, ans=0.0 2023-06-27 00:31:17,289 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1686972.0, ans=0.125 2023-06-27 00:31:41,286 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1687032.0, ans=0.2 2023-06-27 00:32:02,516 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.13 vs. limit=22.5 2023-06-27 00:32:29,105 INFO [train.py:996] (0/4) Epoch 10, batch 6750, loss[loss=0.2263, simple_loss=0.2916, pruned_loss=0.08053, over 21782.00 frames. ], tot_loss[loss=0.2006, simple_loss=0.2739, pruned_loss=0.06365, over 4261943.64 frames. ], batch size: 332, lr: 2.98e-03, grad_scale: 8.0 2023-06-27 00:32:41,012 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.839e+02 5.646e+02 8.043e+02 1.106e+03 2.898e+03, threshold=1.609e+03, percent-clipped=7.0 2023-06-27 00:32:56,556 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1687272.0, ans=10.0 2023-06-27 00:33:48,766 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1687392.0, ans=0.125 2023-06-27 00:33:50,371 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1687452.0, ans=0.2 2023-06-27 00:34:13,874 INFO [train.py:996] (0/4) Epoch 10, batch 6800, loss[loss=0.1976, simple_loss=0.2663, pruned_loss=0.06447, over 21422.00 frames. ], tot_loss[loss=0.2026, simple_loss=0.2748, pruned_loss=0.06516, over 4259768.62 frames. ], batch size: 194, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 00:34:27,046 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.20 vs. limit=15.0 2023-06-27 00:34:29,457 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1687512.0, ans=0.125 2023-06-27 00:34:51,951 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1687572.0, ans=0.125 2023-06-27 00:34:53,443 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1687572.0, ans=0.125 2023-06-27 00:35:25,064 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.95 vs. limit=15.0 2023-06-27 00:36:00,657 INFO [train.py:996] (0/4) Epoch 10, batch 6850, loss[loss=0.2274, simple_loss=0.288, pruned_loss=0.08341, over 21734.00 frames. ], tot_loss[loss=0.2026, simple_loss=0.2738, pruned_loss=0.06566, over 4268499.23 frames. ], batch size: 441, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 00:36:07,585 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.100e+02 5.578e+02 7.964e+02 1.217e+03 2.059e+03, threshold=1.593e+03, percent-clipped=9.0 2023-06-27 00:36:24,013 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1687872.0, ans=0.125 2023-06-27 00:36:29,067 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1687872.0, ans=0.0 2023-06-27 00:37:01,725 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1687932.0, ans=0.125 2023-06-27 00:37:29,261 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1688052.0, ans=0.0 2023-06-27 00:37:47,334 INFO [train.py:996] (0/4) Epoch 10, batch 6900, loss[loss=0.1987, simple_loss=0.2657, pruned_loss=0.0659, over 21276.00 frames. ], tot_loss[loss=0.2039, simple_loss=0.2759, pruned_loss=0.06596, over 4268467.30 frames. ], batch size: 143, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 00:38:01,978 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1688112.0, ans=0.1 2023-06-27 00:38:03,769 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1688112.0, ans=0.0 2023-06-27 00:38:03,835 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1688112.0, ans=0.0 2023-06-27 00:38:03,870 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1688112.0, ans=0.1 2023-06-27 00:38:52,070 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1688232.0, ans=0.0 2023-06-27 00:39:33,065 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1688352.0, ans=0.125 2023-06-27 00:39:36,821 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1688352.0, ans=0.1 2023-06-27 00:39:41,220 INFO [train.py:996] (0/4) Epoch 10, batch 6950, loss[loss=0.2334, simple_loss=0.3206, pruned_loss=0.0731, over 21493.00 frames. ], tot_loss[loss=0.2018, simple_loss=0.278, pruned_loss=0.06278, over 4272782.53 frames. ], batch size: 131, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 00:39:47,980 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.023e+02 6.673e+02 8.913e+02 1.216e+03 2.486e+03, threshold=1.783e+03, percent-clipped=9.0 2023-06-27 00:40:07,926 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1688472.0, ans=0.2 2023-06-27 00:40:18,155 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1688472.0, ans=0.125 2023-06-27 00:40:21,615 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1688472.0, ans=0.125 2023-06-27 00:40:31,732 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1688532.0, ans=0.125 2023-06-27 00:40:45,586 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1688592.0, ans=0.2 2023-06-27 00:41:28,528 INFO [train.py:996] (0/4) Epoch 10, batch 7000, loss[loss=0.2063, simple_loss=0.2669, pruned_loss=0.07287, over 21750.00 frames. ], tot_loss[loss=0.207, simple_loss=0.2821, pruned_loss=0.06597, over 4275503.46 frames. ], batch size: 317, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 00:41:36,003 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=1688712.0, ans=10.0 2023-06-27 00:41:53,122 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1688772.0, ans=0.125 2023-06-27 00:42:31,250 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1688892.0, ans=10.0 2023-06-27 00:43:15,459 INFO [train.py:996] (0/4) Epoch 10, batch 7050, loss[loss=0.2055, simple_loss=0.298, pruned_loss=0.05649, over 21598.00 frames. ], tot_loss[loss=0.2045, simple_loss=0.28, pruned_loss=0.06448, over 4261976.76 frames. ], batch size: 389, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 00:43:27,751 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.782e+02 6.761e+02 1.057e+03 1.502e+03 3.144e+03, threshold=2.115e+03, percent-clipped=16.0 2023-06-27 00:43:58,664 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1689072.0, ans=0.0 2023-06-27 00:44:39,019 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1689192.0, ans=0.0 2023-06-27 00:45:09,745 INFO [train.py:996] (0/4) Epoch 10, batch 7100, loss[loss=0.1601, simple_loss=0.2356, pruned_loss=0.04227, over 21137.00 frames. ], tot_loss[loss=0.2064, simple_loss=0.2837, pruned_loss=0.06462, over 4258830.33 frames. ], batch size: 143, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 00:46:08,382 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.14 vs. limit=15.0 2023-06-27 00:46:29,319 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.24 vs. limit=22.5 2023-06-27 00:46:41,883 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1689552.0, ans=0.125 2023-06-27 00:47:02,319 INFO [train.py:996] (0/4) Epoch 10, batch 7150, loss[loss=0.2974, simple_loss=0.3547, pruned_loss=0.12, over 21292.00 frames. ], tot_loss[loss=0.2043, simple_loss=0.2817, pruned_loss=0.06345, over 4264367.25 frames. ], batch size: 507, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 00:47:09,232 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.861e+02 6.064e+02 8.725e+02 1.357e+03 2.823e+03, threshold=1.745e+03, percent-clipped=6.0 2023-06-27 00:47:11,672 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1689612.0, ans=0.125 2023-06-27 00:47:14,890 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1689612.0, ans=0.1 2023-06-27 00:47:25,483 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1689672.0, ans=0.95 2023-06-27 00:47:34,065 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1689672.0, ans=0.125 2023-06-27 00:48:21,115 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.64 vs. limit=10.0 2023-06-27 00:48:43,483 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.62 vs. limit=12.0 2023-06-27 00:48:48,720 INFO [train.py:996] (0/4) Epoch 10, batch 7200, loss[loss=0.2092, simple_loss=0.2774, pruned_loss=0.07044, over 21775.00 frames. ], tot_loss[loss=0.208, simple_loss=0.2845, pruned_loss=0.06576, over 4271235.47 frames. ], batch size: 124, lr: 2.98e-03, grad_scale: 32.0 2023-06-27 00:49:20,620 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1689972.0, ans=0.125 2023-06-27 00:49:29,620 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.68 vs. limit=6.0 2023-06-27 00:49:55,096 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.38 vs. limit=10.0 2023-06-27 00:50:13,608 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.35 vs. limit=12.0 2023-06-27 00:50:34,351 INFO [train.py:996] (0/4) Epoch 10, batch 7250, loss[loss=0.2036, simple_loss=0.2686, pruned_loss=0.06935, over 21635.00 frames. ], tot_loss[loss=0.2055, simple_loss=0.2799, pruned_loss=0.06559, over 4271172.53 frames. ], batch size: 393, lr: 2.98e-03, grad_scale: 32.0 2023-06-27 00:50:40,769 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.886e+02 6.230e+02 8.378e+02 1.198e+03 2.214e+03, threshold=1.676e+03, percent-clipped=4.0 2023-06-27 00:51:05,617 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1690272.0, ans=0.2 2023-06-27 00:51:55,817 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 00:52:00,724 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1690452.0, ans=0.2 2023-06-27 00:52:07,444 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1690452.0, ans=0.125 2023-06-27 00:52:08,042 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.16 vs. limit=15.0 2023-06-27 00:52:16,000 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1690452.0, ans=0.125 2023-06-27 00:52:18,313 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.66 vs. limit=22.5 2023-06-27 00:52:18,837 INFO [train.py:996] (0/4) Epoch 10, batch 7300, loss[loss=0.229, simple_loss=0.3409, pruned_loss=0.05852, over 19731.00 frames. ], tot_loss[loss=0.2019, simple_loss=0.2747, pruned_loss=0.06451, over 4261491.50 frames. ], batch size: 703, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 00:52:19,480 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1690512.0, ans=0.0 2023-06-27 00:54:06,793 INFO [train.py:996] (0/4) Epoch 10, batch 7350, loss[loss=0.2175, simple_loss=0.2825, pruned_loss=0.07623, over 21686.00 frames. ], tot_loss[loss=0.2031, simple_loss=0.2743, pruned_loss=0.06591, over 4261641.98 frames. ], batch size: 247, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 00:54:15,735 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.980e+02 5.910e+02 7.871e+02 1.338e+03 3.655e+03, threshold=1.574e+03, percent-clipped=15.0 2023-06-27 00:54:31,366 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.27 vs. limit=6.0 2023-06-27 00:54:40,059 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1690872.0, ans=0.0 2023-06-27 00:55:26,113 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1690992.0, ans=0.125 2023-06-27 00:55:56,541 INFO [train.py:996] (0/4) Epoch 10, batch 7400, loss[loss=0.2379, simple_loss=0.3339, pruned_loss=0.07091, over 21549.00 frames. ], tot_loss[loss=0.2085, simple_loss=0.2805, pruned_loss=0.06825, over 4268339.29 frames. ], batch size: 473, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 00:55:59,011 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1691112.0, ans=0.125 2023-06-27 00:56:24,882 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1691172.0, ans=0.0 2023-06-27 00:56:26,554 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1691172.0, ans=0.2 2023-06-27 00:56:36,451 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1691172.0, ans=0.0 2023-06-27 00:56:55,118 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1691232.0, ans=0.125 2023-06-27 00:56:55,246 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1691232.0, ans=0.05 2023-06-27 00:57:12,162 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1691292.0, ans=0.2 2023-06-27 00:57:20,594 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1691292.0, ans=0.125 2023-06-27 00:57:41,596 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1691412.0, ans=10.0 2023-06-27 00:57:42,489 INFO [train.py:996] (0/4) Epoch 10, batch 7450, loss[loss=0.1872, simple_loss=0.255, pruned_loss=0.05976, over 21565.00 frames. ], tot_loss[loss=0.2057, simple_loss=0.2787, pruned_loss=0.06639, over 4262264.33 frames. ], batch size: 213, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 00:57:56,773 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.982e+02 5.896e+02 9.357e+02 1.491e+03 2.777e+03, threshold=1.871e+03, percent-clipped=18.0 2023-06-27 00:58:21,024 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1691472.0, ans=0.0 2023-06-27 00:58:28,039 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1691532.0, ans=0.125 2023-06-27 00:59:03,184 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1691592.0, ans=0.04949747468305833 2023-06-27 00:59:26,195 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1691652.0, ans=0.0 2023-06-27 00:59:37,967 INFO [train.py:996] (0/4) Epoch 10, batch 7500, loss[loss=0.2233, simple_loss=0.3143, pruned_loss=0.06617, over 21213.00 frames. ], tot_loss[loss=0.2095, simple_loss=0.2839, pruned_loss=0.06756, over 4262937.94 frames. ], batch size: 143, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 01:00:07,138 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1691772.0, ans=0.0 2023-06-27 01:00:17,382 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1691772.0, ans=0.125 2023-06-27 01:01:07,891 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.94 vs. limit=15.0 2023-06-27 01:01:31,419 INFO [train.py:996] (0/4) Epoch 10, batch 7550, loss[loss=0.191, simple_loss=0.2426, pruned_loss=0.06973, over 20334.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2917, pruned_loss=0.06751, over 4263862.87 frames. ], batch size: 703, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 01:01:36,833 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1692012.0, ans=0.0 2023-06-27 01:01:39,830 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.177e+02 6.369e+02 9.874e+02 1.839e+03 3.635e+03, threshold=1.975e+03, percent-clipped=22.0 2023-06-27 01:01:45,557 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1692012.0, ans=0.2 2023-06-27 01:02:52,163 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1692252.0, ans=0.2 2023-06-27 01:02:55,604 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1692252.0, ans=0.1 2023-06-27 01:03:06,874 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=16.24 vs. limit=22.5 2023-06-27 01:03:11,970 INFO [train.py:996] (0/4) Epoch 10, batch 7600, loss[loss=0.2002, simple_loss=0.2816, pruned_loss=0.05945, over 21899.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.29, pruned_loss=0.06584, over 4269885.62 frames. ], batch size: 332, lr: 2.97e-03, grad_scale: 32.0 2023-06-27 01:03:20,181 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1692312.0, ans=0.0 2023-06-27 01:03:20,800 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.42 vs. limit=15.0 2023-06-27 01:03:29,942 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1692312.0, ans=0.0 2023-06-27 01:03:30,062 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1692312.0, ans=0.2 2023-06-27 01:04:14,362 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1692432.0, ans=0.5 2023-06-27 01:05:03,927 INFO [train.py:996] (0/4) Epoch 10, batch 7650, loss[loss=0.2109, simple_loss=0.2832, pruned_loss=0.06928, over 21461.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.2896, pruned_loss=0.06759, over 4282555.80 frames. ], batch size: 131, lr: 2.97e-03, grad_scale: 32.0 2023-06-27 01:05:06,281 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1692612.0, ans=0.0 2023-06-27 01:05:12,464 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.938e+02 5.695e+02 7.737e+02 9.992e+02 2.893e+03, threshold=1.547e+03, percent-clipped=4.0 2023-06-27 01:05:40,745 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1692672.0, ans=0.2 2023-06-27 01:06:19,519 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1692792.0, ans=0.125 2023-06-27 01:06:52,713 INFO [train.py:996] (0/4) Epoch 10, batch 7700, loss[loss=0.2512, simple_loss=0.3223, pruned_loss=0.09005, over 21309.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.2924, pruned_loss=0.07046, over 4284944.59 frames. ], batch size: 159, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:07:02,143 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1692912.0, ans=0.0 2023-06-27 01:07:29,014 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1692972.0, ans=0.0 2023-06-27 01:07:34,785 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.74 vs. limit=15.0 2023-06-27 01:07:39,866 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1693032.0, ans=0.0 2023-06-27 01:08:43,831 INFO [train.py:996] (0/4) Epoch 10, batch 7750, loss[loss=0.2567, simple_loss=0.3659, pruned_loss=0.07375, over 21782.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.2988, pruned_loss=0.07109, over 4285142.64 frames. ], batch size: 332, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:09:05,029 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.135e+02 8.248e+02 1.279e+03 1.795e+03 4.947e+03, threshold=2.557e+03, percent-clipped=28.0 2023-06-27 01:10:07,395 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.83 vs. limit=6.0 2023-06-27 01:10:10,215 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1693392.0, ans=0.125 2023-06-27 01:10:17,701 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.74 vs. limit=15.0 2023-06-27 01:10:19,157 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1693452.0, ans=0.125 2023-06-27 01:10:42,230 INFO [train.py:996] (0/4) Epoch 10, batch 7800, loss[loss=0.1805, simple_loss=0.2358, pruned_loss=0.06258, over 20802.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.3029, pruned_loss=0.07161, over 4281227.73 frames. ], batch size: 609, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:10:52,971 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1693512.0, ans=0.125 2023-06-27 01:11:24,878 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1693632.0, ans=0.125 2023-06-27 01:11:30,195 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1693632.0, ans=0.035 2023-06-27 01:11:41,931 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.32 vs. limit=15.0 2023-06-27 01:12:08,497 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.27 vs. limit=15.0 2023-06-27 01:12:12,614 INFO [train.py:996] (0/4) Epoch 10, batch 7850, loss[loss=0.2211, simple_loss=0.2885, pruned_loss=0.07687, over 21972.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.2962, pruned_loss=0.07061, over 4264812.43 frames. ], batch size: 113, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:12:32,520 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.059e+02 5.917e+02 8.514e+02 1.468e+03 3.815e+03, threshold=1.703e+03, percent-clipped=5.0 2023-06-27 01:12:38,666 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1693872.0, ans=0.1 2023-06-27 01:12:47,535 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1693872.0, ans=0.125 2023-06-27 01:13:59,659 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1694052.0, ans=0.1 2023-06-27 01:14:08,064 INFO [train.py:996] (0/4) Epoch 10, batch 7900, loss[loss=0.2757, simple_loss=0.3671, pruned_loss=0.0921, over 21478.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2913, pruned_loss=0.07004, over 4267926.46 frames. ], batch size: 471, lr: 2.97e-03, grad_scale: 8.0 2023-06-27 01:14:29,917 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.89 vs. limit=15.0 2023-06-27 01:14:50,948 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1694232.0, ans=0.2 2023-06-27 01:15:56,688 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1694352.0, ans=0.04949747468305833 2023-06-27 01:16:00,496 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1694352.0, ans=0.2 2023-06-27 01:16:02,336 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1694352.0, ans=0.125 2023-06-27 01:16:03,729 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1694412.0, ans=0.125 2023-06-27 01:16:04,757 INFO [train.py:996] (0/4) Epoch 10, batch 7950, loss[loss=0.2144, simple_loss=0.2791, pruned_loss=0.07483, over 20121.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.2938, pruned_loss=0.06953, over 4260745.61 frames. ], batch size: 705, lr: 2.97e-03, grad_scale: 8.0 2023-06-27 01:16:10,797 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1694412.0, ans=0.0 2023-06-27 01:16:16,927 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.966e+02 5.576e+02 7.742e+02 1.234e+03 3.670e+03, threshold=1.548e+03, percent-clipped=16.0 2023-06-27 01:16:19,534 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1694412.0, ans=0.125 2023-06-27 01:16:50,267 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1694532.0, ans=0.1 2023-06-27 01:17:22,877 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1694592.0, ans=0.2 2023-06-27 01:17:56,263 INFO [train.py:996] (0/4) Epoch 10, batch 8000, loss[loss=0.2255, simple_loss=0.3089, pruned_loss=0.07104, over 17181.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.2985, pruned_loss=0.07159, over 4258220.95 frames. ], batch size: 60, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:19:14,218 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1694892.0, ans=0.125 2023-06-27 01:20:02,341 INFO [train.py:996] (0/4) Epoch 10, batch 8050, loss[loss=0.3136, simple_loss=0.3948, pruned_loss=0.1162, over 21455.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.3048, pruned_loss=0.07219, over 4261469.48 frames. ], batch size: 507, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:20:14,620 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.526e+02 7.082e+02 1.044e+03 1.392e+03 2.627e+03, threshold=2.088e+03, percent-clipped=20.0 2023-06-27 01:20:36,736 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.16 vs. limit=15.0 2023-06-27 01:21:07,766 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1695192.0, ans=0.125 2023-06-27 01:21:11,532 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1695192.0, ans=0.0 2023-06-27 01:21:27,273 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1695252.0, ans=0.125 2023-06-27 01:21:32,504 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1695252.0, ans=0.125 2023-06-27 01:21:37,781 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1695252.0, ans=0.0 2023-06-27 01:21:51,396 INFO [train.py:996] (0/4) Epoch 10, batch 8100, loss[loss=0.1906, simple_loss=0.2676, pruned_loss=0.05675, over 21859.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.302, pruned_loss=0.0721, over 4263537.09 frames. ], batch size: 282, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:22:02,534 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.34 vs. limit=15.0 2023-06-27 01:22:46,448 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.80 vs. limit=15.0 2023-06-27 01:23:17,229 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1695492.0, ans=10.0 2023-06-27 01:23:50,269 INFO [train.py:996] (0/4) Epoch 10, batch 8150, loss[loss=0.2285, simple_loss=0.3364, pruned_loss=0.06029, over 21570.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.3091, pruned_loss=0.07275, over 4262425.65 frames. ], batch size: 389, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:23:54,830 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=1695612.0, ans=0.05 2023-06-27 01:24:07,919 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.023e+02 5.816e+02 8.551e+02 1.587e+03 5.169e+03, threshold=1.710e+03, percent-clipped=17.0 2023-06-27 01:24:44,582 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1695732.0, ans=0.125 2023-06-27 01:25:16,905 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.29 vs. limit=12.0 2023-06-27 01:25:33,620 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1695852.0, ans=0.1 2023-06-27 01:25:38,336 INFO [train.py:996] (0/4) Epoch 10, batch 8200, loss[loss=0.1742, simple_loss=0.2299, pruned_loss=0.05927, over 20738.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.3013, pruned_loss=0.07047, over 4263870.97 frames. ], batch size: 609, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:26:54,278 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1696092.0, ans=0.125 2023-06-27 01:27:22,774 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1696152.0, ans=0.125 2023-06-27 01:27:22,818 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1696152.0, ans=0.2 2023-06-27 01:27:32,704 INFO [train.py:996] (0/4) Epoch 10, batch 8250, loss[loss=0.2837, simple_loss=0.3701, pruned_loss=0.09858, over 21502.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.3, pruned_loss=0.07092, over 4261681.16 frames. ], batch size: 471, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:27:44,591 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.725e+02 5.485e+02 7.641e+02 1.335e+03 2.771e+03, threshold=1.528e+03, percent-clipped=11.0 2023-06-27 01:28:04,417 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1696272.0, ans=0.2 2023-06-27 01:28:51,460 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1696392.0, ans=0.125 2023-06-27 01:29:21,569 INFO [train.py:996] (0/4) Epoch 10, batch 8300, loss[loss=0.2462, simple_loss=0.3314, pruned_loss=0.08053, over 21576.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.2974, pruned_loss=0.06807, over 4265737.99 frames. ], batch size: 441, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:30:20,331 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1696632.0, ans=0.125 2023-06-27 01:30:24,197 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1696692.0, ans=0.0 2023-06-27 01:30:29,853 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.23 vs. limit=15.0 2023-06-27 01:30:43,295 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1696752.0, ans=0.1 2023-06-27 01:31:11,522 INFO [train.py:996] (0/4) Epoch 10, batch 8350, loss[loss=0.2006, simple_loss=0.2861, pruned_loss=0.05759, over 21638.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2969, pruned_loss=0.06607, over 4264773.58 frames. ], batch size: 263, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:31:22,491 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1696812.0, ans=0.125 2023-06-27 01:31:23,487 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.611e+02 5.774e+02 7.528e+02 1.140e+03 3.100e+03, threshold=1.506e+03, percent-clipped=11.0 2023-06-27 01:31:47,535 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1696872.0, ans=10.0 2023-06-27 01:32:39,319 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1697052.0, ans=0.0 2023-06-27 01:32:43,295 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1697052.0, ans=0.1 2023-06-27 01:33:01,163 INFO [train.py:996] (0/4) Epoch 10, batch 8400, loss[loss=0.1839, simple_loss=0.27, pruned_loss=0.04891, over 21405.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2939, pruned_loss=0.06429, over 4258935.12 frames. ], batch size: 194, lr: 2.97e-03, grad_scale: 32.0 2023-06-27 01:33:19,350 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1697172.0, ans=0.1 2023-06-27 01:33:30,208 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.60 vs. limit=15.0 2023-06-27 01:33:37,864 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1697172.0, ans=0.0 2023-06-27 01:33:37,946 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1697172.0, ans=0.125 2023-06-27 01:33:39,994 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1697232.0, ans=0.0 2023-06-27 01:34:23,748 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1697292.0, ans=0.125 2023-06-27 01:34:48,827 INFO [train.py:996] (0/4) Epoch 10, batch 8450, loss[loss=0.2144, simple_loss=0.2863, pruned_loss=0.07129, over 21855.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.2922, pruned_loss=0.06307, over 4258646.93 frames. ], batch size: 351, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:35:02,446 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.213e+02 7.215e+02 1.072e+03 1.642e+03 3.949e+03, threshold=2.143e+03, percent-clipped=30.0 2023-06-27 01:35:02,952 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1697412.0, ans=0.0 2023-06-27 01:35:08,371 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.41 vs. limit=15.0 2023-06-27 01:35:41,056 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1697532.0, ans=0.1 2023-06-27 01:35:44,437 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1697532.0, ans=0.5 2023-06-27 01:36:10,710 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1697592.0, ans=0.125 2023-06-27 01:36:37,997 INFO [train.py:996] (0/4) Epoch 10, batch 8500, loss[loss=0.2378, simple_loss=0.2994, pruned_loss=0.08809, over 14995.00 frames. ], tot_loss[loss=0.2088, simple_loss=0.2893, pruned_loss=0.06411, over 4255202.55 frames. ], batch size: 60, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:37:11,856 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.82 vs. limit=22.5 2023-06-27 01:37:28,239 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.79 vs. limit=15.0 2023-06-27 01:38:24,332 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.33 vs. limit=15.0 2023-06-27 01:38:28,166 INFO [train.py:996] (0/4) Epoch 10, batch 8550, loss[loss=0.2051, simple_loss=0.2893, pruned_loss=0.06043, over 21364.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2932, pruned_loss=0.06685, over 4266699.00 frames. ], batch size: 176, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:38:37,576 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1698012.0, ans=0.125 2023-06-27 01:38:41,907 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.182e+02 6.171e+02 1.011e+03 1.607e+03 3.555e+03, threshold=2.023e+03, percent-clipped=12.0 2023-06-27 01:39:17,131 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.89 vs. limit=6.0 2023-06-27 01:39:55,041 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1698192.0, ans=0.0 2023-06-27 01:40:14,894 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.whiten.whitening_limit, batch_count=1698252.0, ans=12.0 2023-06-27 01:40:17,115 INFO [train.py:996] (0/4) Epoch 10, batch 8600, loss[loss=0.2467, simple_loss=0.3251, pruned_loss=0.08412, over 21618.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2977, pruned_loss=0.06817, over 4272387.54 frames. ], batch size: 389, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:41:13,604 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.89 vs. limit=15.0 2023-06-27 01:41:51,536 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.78 vs. limit=22.5 2023-06-27 01:42:05,297 INFO [train.py:996] (0/4) Epoch 10, batch 8650, loss[loss=0.2106, simple_loss=0.3163, pruned_loss=0.0525, over 21650.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.3034, pruned_loss=0.06952, over 4271764.85 frames. ], batch size: 414, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:42:24,799 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.451e+02 5.765e+02 7.630e+02 1.183e+03 2.009e+03, threshold=1.526e+03, percent-clipped=0.0 2023-06-27 01:43:16,961 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.80 vs. limit=15.0 2023-06-27 01:43:34,724 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.68 vs. limit=12.0 2023-06-27 01:43:47,536 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 01:43:50,336 INFO [train.py:996] (0/4) Epoch 10, batch 8700, loss[loss=0.1788, simple_loss=0.2422, pruned_loss=0.05772, over 21664.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.2951, pruned_loss=0.06556, over 4268495.86 frames. ], batch size: 282, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:43:54,085 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1698912.0, ans=0.0 2023-06-27 01:44:44,414 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1699032.0, ans=0.0 2023-06-27 01:44:44,489 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1699032.0, ans=0.125 2023-06-27 01:44:44,566 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1699032.0, ans=0.125 2023-06-27 01:44:49,300 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1699032.0, ans=0.0 2023-06-27 01:45:08,676 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1699092.0, ans=0.0 2023-06-27 01:45:15,534 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1699092.0, ans=0.125 2023-06-27 01:45:39,154 INFO [train.py:996] (0/4) Epoch 10, batch 8750, loss[loss=0.2452, simple_loss=0.3025, pruned_loss=0.09401, over 21693.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.29, pruned_loss=0.06596, over 4263833.68 frames. ], batch size: 473, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:45:59,224 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.645e+02 6.087e+02 8.152e+02 1.140e+03 2.309e+03, threshold=1.630e+03, percent-clipped=11.0 2023-06-27 01:46:23,488 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1699272.0, ans=0.125 2023-06-27 01:46:29,229 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.13 vs. limit=12.0 2023-06-27 01:46:41,558 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.48 vs. limit=12.0 2023-06-27 01:47:34,999 INFO [train.py:996] (0/4) Epoch 10, batch 8800, loss[loss=0.2429, simple_loss=0.3251, pruned_loss=0.08033, over 21719.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2988, pruned_loss=0.06904, over 4273984.83 frames. ], batch size: 298, lr: 2.97e-03, grad_scale: 32.0 2023-06-27 01:47:50,410 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1699512.0, ans=0.125 2023-06-27 01:48:40,240 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1699632.0, ans=0.1 2023-06-27 01:49:33,051 INFO [train.py:996] (0/4) Epoch 10, batch 8850, loss[loss=0.2326, simple_loss=0.3191, pruned_loss=0.07304, over 21285.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.3045, pruned_loss=0.07089, over 4275896.47 frames. ], batch size: 159, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:49:47,074 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1699812.0, ans=0.0 2023-06-27 01:49:48,559 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.063e+02 5.642e+02 7.591e+02 1.245e+03 2.739e+03, threshold=1.518e+03, percent-clipped=14.0 2023-06-27 01:49:50,627 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1699872.0, ans=0.5 2023-06-27 01:51:22,894 INFO [train.py:996] (0/4) Epoch 10, batch 8900, loss[loss=0.2166, simple_loss=0.2853, pruned_loss=0.07394, over 21569.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.2994, pruned_loss=0.07008, over 4275284.12 frames. ], batch size: 414, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:51:29,840 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.96 vs. limit=22.5 2023-06-27 01:52:23,791 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.38 vs. limit=15.0 2023-06-27 01:52:43,425 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1700292.0, ans=0.1 2023-06-27 01:53:00,085 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1700352.0, ans=0.125 2023-06-27 01:53:08,344 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1700352.0, ans=0.0 2023-06-27 01:53:21,316 INFO [train.py:996] (0/4) Epoch 10, batch 8950, loss[loss=0.226, simple_loss=0.3065, pruned_loss=0.07277, over 21702.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.3042, pruned_loss=0.07005, over 4265608.66 frames. ], batch size: 298, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:53:27,397 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1700412.0, ans=0.2 2023-06-27 01:53:42,456 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.709e+02 6.064e+02 9.607e+02 1.976e+03 3.801e+03, threshold=1.921e+03, percent-clipped=34.0 2023-06-27 01:54:41,472 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1700592.0, ans=0.0 2023-06-27 01:55:06,971 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1700652.0, ans=0.125 2023-06-27 01:55:09,681 INFO [train.py:996] (0/4) Epoch 10, batch 9000, loss[loss=0.1706, simple_loss=0.2431, pruned_loss=0.04907, over 21870.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2974, pruned_loss=0.06917, over 4265099.32 frames. ], batch size: 118, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:55:09,683 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-27 01:55:25,197 INFO [zipformer.py:1728] (0/4) name=encoder.encoders.4.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([2.2468, 3.7891, 3.4181, 2.3417], device='cuda:0') 2023-06-27 01:55:27,999 INFO [train.py:1028] (0/4) Epoch 10, validation: loss=0.2678, simple_loss=0.3533, pruned_loss=0.09113, over 1796401.00 frames. 2023-06-27 01:55:27,999 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23714MB 2023-06-27 01:56:14,466 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.77 vs. limit=6.0 2023-06-27 01:56:48,562 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1700892.0, ans=0.1 2023-06-27 01:56:50,903 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.40 vs. limit=15.0 2023-06-27 01:57:23,159 INFO [train.py:996] (0/4) Epoch 10, batch 9050, loss[loss=0.2068, simple_loss=0.2914, pruned_loss=0.06105, over 21688.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.2899, pruned_loss=0.06594, over 4256858.76 frames. ], batch size: 351, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:57:26,134 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1701012.0, ans=0.0 2023-06-27 01:57:45,756 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.665e+02 7.496e+02 1.289e+03 1.830e+03 3.310e+03, threshold=2.578e+03, percent-clipped=22.0 2023-06-27 01:58:24,424 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.55 vs. limit=15.0 2023-06-27 01:58:30,672 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1701132.0, ans=0.125 2023-06-27 01:58:38,480 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=6.13 vs. limit=22.5 2023-06-27 01:58:40,437 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.40 vs. limit=12.0 2023-06-27 01:59:13,807 INFO [train.py:996] (0/4) Epoch 10, batch 9100, loss[loss=0.3031, simple_loss=0.3619, pruned_loss=0.1222, over 21329.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.2955, pruned_loss=0.06892, over 4262863.50 frames. ], batch size: 507, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 02:00:23,986 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1701492.0, ans=0.07 2023-06-27 02:00:36,201 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1701492.0, ans=0.0 2023-06-27 02:01:09,277 INFO [train.py:996] (0/4) Epoch 10, batch 9150, loss[loss=0.2176, simple_loss=0.2994, pruned_loss=0.06792, over 21414.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2982, pruned_loss=0.06618, over 4264999.36 frames. ], batch size: 160, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 02:01:16,844 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1701612.0, ans=0.125 2023-06-27 02:01:18,779 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1701612.0, ans=0.125 2023-06-27 02:01:24,806 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.482e+02 5.209e+02 7.364e+02 1.147e+03 3.350e+03, threshold=1.473e+03, percent-clipped=3.0 2023-06-27 02:02:34,495 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1701852.0, ans=0.0 2023-06-27 02:02:48,190 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.07 vs. limit=22.5 2023-06-27 02:02:58,997 INFO [train.py:996] (0/4) Epoch 10, batch 9200, loss[loss=0.2363, simple_loss=0.3237, pruned_loss=0.07441, over 21287.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.3007, pruned_loss=0.06581, over 4274200.99 frames. ], batch size: 548, lr: 2.97e-03, grad_scale: 32.0 2023-06-27 02:03:35,742 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=1701972.0, ans=10.0 2023-06-27 02:03:36,431 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.82 vs. limit=6.0 2023-06-27 02:04:04,805 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1702092.0, ans=0.2 2023-06-27 02:04:07,111 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.82 vs. limit=15.0 2023-06-27 02:04:18,288 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 02:04:33,831 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1702152.0, ans=0.125 2023-06-27 02:04:45,314 INFO [train.py:996] (0/4) Epoch 10, batch 9250, loss[loss=0.2125, simple_loss=0.279, pruned_loss=0.07296, over 21561.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.3048, pruned_loss=0.06883, over 4272146.96 frames. ], batch size: 391, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 02:05:02,710 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.922e+02 6.299e+02 8.423e+02 1.393e+03 3.022e+03, threshold=1.685e+03, percent-clipped=24.0 2023-06-27 02:05:28,563 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1702272.0, ans=0.2 2023-06-27 02:06:35,096 INFO [train.py:996] (0/4) Epoch 10, batch 9300, loss[loss=0.2297, simple_loss=0.3193, pruned_loss=0.07012, over 21416.00 frames. ], tot_loss[loss=0.218, simple_loss=0.298, pruned_loss=0.06897, over 4265486.44 frames. ], batch size: 211, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 02:06:44,759 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1702512.0, ans=0.2 2023-06-27 02:06:55,429 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1702512.0, ans=0.1 2023-06-27 02:06:55,506 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff3.min_abs, batch_count=1702512.0, ans=0.2 2023-06-27 02:07:05,500 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1702572.0, ans=0.125 2023-06-27 02:07:53,178 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1702692.0, ans=0.1 2023-06-27 02:08:18,195 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1702812.0, ans=0.1 2023-06-27 02:08:19,216 INFO [train.py:996] (0/4) Epoch 10, batch 9350, loss[loss=0.2452, simple_loss=0.3267, pruned_loss=0.08187, over 21861.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.3051, pruned_loss=0.06981, over 4264758.70 frames. ], batch size: 371, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 02:08:29,235 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1702812.0, ans=0.125 2023-06-27 02:08:47,201 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.895e+02 6.669e+02 9.528e+02 1.719e+03 4.361e+03, threshold=1.906e+03, percent-clipped=26.0 2023-06-27 02:09:32,655 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1702992.0, ans=0.125 2023-06-27 02:09:34,195 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1702992.0, ans=0.1 2023-06-27 02:10:15,140 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.11 vs. limit=22.5 2023-06-27 02:10:18,894 INFO [train.py:996] (0/4) Epoch 10, batch 9400, loss[loss=0.2082, simple_loss=0.2749, pruned_loss=0.07071, over 21165.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.3055, pruned_loss=0.07015, over 4263073.86 frames. ], batch size: 143, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 02:10:25,001 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=1703112.0, ans=0.5 2023-06-27 02:10:53,921 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1703172.0, ans=0.125 2023-06-27 02:11:08,455 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.40 vs. limit=22.5 2023-06-27 02:12:05,128 INFO [train.py:996] (0/4) Epoch 10, batch 9450, loss[loss=0.2389, simple_loss=0.3543, pruned_loss=0.06177, over 20852.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2967, pruned_loss=0.06879, over 4258152.16 frames. ], batch size: 608, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 02:12:22,342 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.110e+02 5.502e+02 7.576e+02 1.129e+03 2.324e+03, threshold=1.515e+03, percent-clipped=5.0 2023-06-27 02:13:39,141 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1703652.0, ans=0.125 2023-06-27 02:13:52,555 INFO [train.py:996] (0/4) Epoch 10, batch 9500, loss[loss=0.1941, simple_loss=0.2801, pruned_loss=0.05409, over 21616.00 frames. ], tot_loss[loss=0.212, simple_loss=0.2899, pruned_loss=0.06707, over 4251565.68 frames. ], batch size: 441, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 02:14:51,918 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1703832.0, ans=0.0 2023-06-27 02:15:35,979 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-284000.pt 2023-06-27 02:15:42,681 INFO [train.py:996] (0/4) Epoch 10, batch 9550, loss[loss=0.2557, simple_loss=0.3311, pruned_loss=0.09013, over 21432.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.2936, pruned_loss=0.06901, over 4261948.35 frames. ], batch size: 194, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 02:15:48,298 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1704012.0, ans=0.125 2023-06-27 02:15:52,424 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.05 vs. limit=15.0 2023-06-27 02:16:04,336 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.95 vs. limit=12.0 2023-06-27 02:16:04,753 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.287e+02 6.617e+02 9.297e+02 1.429e+03 3.226e+03, threshold=1.859e+03, percent-clipped=22.0 2023-06-27 02:16:13,151 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.55 vs. limit=15.0 2023-06-27 02:17:29,884 INFO [train.py:996] (0/4) Epoch 10, batch 9600, loss[loss=0.2034, simple_loss=0.2816, pruned_loss=0.06262, over 21770.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.2961, pruned_loss=0.07061, over 4266818.05 frames. ], batch size: 112, lr: 2.96e-03, grad_scale: 32.0 2023-06-27 02:18:42,276 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1704492.0, ans=0.125 2023-06-27 02:18:46,117 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1704492.0, ans=0.2 2023-06-27 02:19:13,793 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1704552.0, ans=0.0 2023-06-27 02:19:26,584 INFO [train.py:996] (0/4) Epoch 10, batch 9650, loss[loss=0.2173, simple_loss=0.2948, pruned_loss=0.06993, over 21823.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2955, pruned_loss=0.07032, over 4272876.72 frames. ], batch size: 282, lr: 2.96e-03, grad_scale: 32.0 2023-06-27 02:19:36,402 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1704612.0, ans=0.1 2023-06-27 02:19:45,812 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.795e+02 6.257e+02 8.564e+02 1.301e+03 2.812e+03, threshold=1.713e+03, percent-clipped=7.0 2023-06-27 02:20:27,518 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1704792.0, ans=0.125 2023-06-27 02:21:02,372 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1704852.0, ans=0.1 2023-06-27 02:21:15,596 INFO [train.py:996] (0/4) Epoch 10, batch 9700, loss[loss=0.2106, simple_loss=0.3016, pruned_loss=0.05982, over 21810.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.2997, pruned_loss=0.07078, over 4270452.94 frames. ], batch size: 332, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 02:23:03,786 INFO [train.py:996] (0/4) Epoch 10, batch 9750, loss[loss=0.1921, simple_loss=0.2604, pruned_loss=0.06191, over 21386.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2943, pruned_loss=0.06914, over 4273044.60 frames. ], batch size: 194, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 02:23:27,954 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.191e+02 6.700e+02 1.068e+03 1.546e+03 3.673e+03, threshold=2.135e+03, percent-clipped=19.0 2023-06-27 02:23:30,310 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1705272.0, ans=0.125 2023-06-27 02:24:43,266 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=13.80 vs. limit=15.0 2023-06-27 02:24:45,100 INFO [train.py:996] (0/4) Epoch 10, batch 9800, loss[loss=0.2004, simple_loss=0.2652, pruned_loss=0.06781, over 21746.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2926, pruned_loss=0.06923, over 4267453.57 frames. ], batch size: 351, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 02:25:19,116 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.51 vs. limit=22.5 2023-06-27 02:25:41,476 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.63 vs. limit=15.0 2023-06-27 02:26:09,583 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1705692.0, ans=0.125 2023-06-27 02:26:19,339 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1705752.0, ans=0.125 2023-06-27 02:26:21,292 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1705752.0, ans=0.0 2023-06-27 02:26:38,263 INFO [train.py:996] (0/4) Epoch 10, batch 9850, loss[loss=0.1849, simple_loss=0.2524, pruned_loss=0.05865, over 21777.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.2907, pruned_loss=0.06928, over 4266463.90 frames. ], batch size: 298, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 02:27:02,352 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.847e+02 5.295e+02 7.367e+02 1.134e+03 2.701e+03, threshold=1.473e+03, percent-clipped=3.0 2023-06-27 02:27:03,063 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1705872.0, ans=0.2 2023-06-27 02:27:59,586 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1705992.0, ans=0.125 2023-06-27 02:27:59,708 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1705992.0, ans=0.125 2023-06-27 02:28:22,393 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.88 vs. limit=22.5 2023-06-27 02:28:26,485 INFO [train.py:996] (0/4) Epoch 10, batch 9900, loss[loss=0.1914, simple_loss=0.2464, pruned_loss=0.06821, over 21230.00 frames. ], tot_loss[loss=0.212, simple_loss=0.287, pruned_loss=0.06855, over 4261284.45 frames. ], batch size: 548, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 02:28:55,405 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1706172.0, ans=0.2 2023-06-27 02:29:25,281 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.96 vs. limit=15.0 2023-06-27 02:30:14,302 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1706412.0, ans=0.125 2023-06-27 02:30:15,218 INFO [train.py:996] (0/4) Epoch 10, batch 9950, loss[loss=0.247, simple_loss=0.2928, pruned_loss=0.1006, over 21409.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.29, pruned_loss=0.07122, over 4260989.09 frames. ], batch size: 509, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 02:30:26,604 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1706412.0, ans=0.1 2023-06-27 02:30:39,767 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.384e+02 6.546e+02 9.078e+02 1.320e+03 2.583e+03, threshold=1.816e+03, percent-clipped=18.0 2023-06-27 02:30:50,556 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1706472.0, ans=0.1 2023-06-27 02:31:06,853 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1706532.0, ans=0.2 2023-06-27 02:31:59,298 INFO [train.py:996] (0/4) Epoch 10, batch 10000, loss[loss=0.1884, simple_loss=0.2547, pruned_loss=0.06103, over 21265.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2861, pruned_loss=0.07053, over 4260949.93 frames. ], batch size: 548, lr: 2.96e-03, grad_scale: 32.0 2023-06-27 02:32:05,416 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1706712.0, ans=0.0 2023-06-27 02:33:01,019 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1706832.0, ans=0.125 2023-06-27 02:33:28,860 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1706892.0, ans=0.1 2023-06-27 02:33:35,863 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 02:33:44,590 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1706952.0, ans=0.1 2023-06-27 02:33:44,595 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1706952.0, ans=0.025 2023-06-27 02:33:57,213 INFO [train.py:996] (0/4) Epoch 10, batch 10050, loss[loss=0.1762, simple_loss=0.2605, pruned_loss=0.04597, over 21783.00 frames. ], tot_loss[loss=0.2129, simple_loss=0.2865, pruned_loss=0.06968, over 4267915.21 frames. ], batch size: 282, lr: 2.96e-03, grad_scale: 32.0 2023-06-27 02:34:01,322 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1707012.0, ans=0.0 2023-06-27 02:34:16,302 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.082e+02 5.853e+02 8.209e+02 1.305e+03 2.955e+03, threshold=1.642e+03, percent-clipped=12.0 2023-06-27 02:34:18,766 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1707072.0, ans=0.125 2023-06-27 02:34:52,074 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1707132.0, ans=0.1 2023-06-27 02:35:45,585 INFO [train.py:996] (0/4) Epoch 10, batch 10100, loss[loss=0.2129, simple_loss=0.2992, pruned_loss=0.06334, over 21740.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2846, pruned_loss=0.0681, over 4268020.37 frames. ], batch size: 332, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 02:36:02,123 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1707372.0, ans=0.2 2023-06-27 02:36:51,614 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1707432.0, ans=0.95 2023-06-27 02:36:58,698 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1707492.0, ans=0.1 2023-06-27 02:37:00,201 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1707492.0, ans=0.1 2023-06-27 02:37:33,974 INFO [train.py:996] (0/4) Epoch 10, batch 10150, loss[loss=0.1906, simple_loss=0.2603, pruned_loss=0.06042, over 21855.00 frames. ], tot_loss[loss=0.2142, simple_loss=0.2891, pruned_loss=0.06967, over 4263926.70 frames. ], batch size: 102, lr: 2.96e-03, grad_scale: 8.0 2023-06-27 02:37:47,086 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.83 vs. limit=15.0 2023-06-27 02:38:02,106 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.860e+02 5.691e+02 7.969e+02 1.243e+03 2.132e+03, threshold=1.594e+03, percent-clipped=9.0 2023-06-27 02:38:25,947 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.28 vs. limit=15.0 2023-06-27 02:38:53,258 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1707792.0, ans=0.0 2023-06-27 02:38:57,088 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1707792.0, ans=0.0 2023-06-27 02:39:22,073 INFO [train.py:996] (0/4) Epoch 10, batch 10200, loss[loss=0.1714, simple_loss=0.2315, pruned_loss=0.05564, over 20803.00 frames. ], tot_loss[loss=0.212, simple_loss=0.2878, pruned_loss=0.06804, over 4269984.24 frames. ], batch size: 608, lr: 2.96e-03, grad_scale: 8.0 2023-06-27 02:40:07,227 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1708032.0, ans=0.025 2023-06-27 02:40:45,243 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.69 vs. limit=22.5 2023-06-27 02:41:05,809 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1708152.0, ans=0.125 2023-06-27 02:41:10,253 INFO [train.py:996] (0/4) Epoch 10, batch 10250, loss[loss=0.1685, simple_loss=0.2622, pruned_loss=0.03741, over 21570.00 frames. ], tot_loss[loss=0.2044, simple_loss=0.2832, pruned_loss=0.06281, over 4268661.73 frames. ], batch size: 230, lr: 2.96e-03, grad_scale: 8.0 2023-06-27 02:41:32,374 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1708272.0, ans=0.125 2023-06-27 02:41:44,125 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.003e+02 5.121e+02 6.832e+02 1.019e+03 2.987e+03, threshold=1.366e+03, percent-clipped=4.0 2023-06-27 02:42:14,855 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1708332.0, ans=0.0 2023-06-27 02:43:03,418 INFO [train.py:996] (0/4) Epoch 10, batch 10300, loss[loss=0.2168, simple_loss=0.3115, pruned_loss=0.06103, over 21817.00 frames. ], tot_loss[loss=0.2059, simple_loss=0.2854, pruned_loss=0.06323, over 4275134.00 frames. ], batch size: 282, lr: 2.96e-03, grad_scale: 8.0 2023-06-27 02:43:37,075 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1708572.0, ans=0.1 2023-06-27 02:43:37,677 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.68 vs. limit=22.5 2023-06-27 02:44:21,936 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1708692.0, ans=0.125 2023-06-27 02:44:37,806 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1708752.0, ans=0.1 2023-06-27 02:45:06,403 INFO [train.py:996] (0/4) Epoch 10, batch 10350, loss[loss=0.1564, simple_loss=0.2119, pruned_loss=0.0505, over 21884.00 frames. ], tot_loss[loss=0.2081, simple_loss=0.2887, pruned_loss=0.06379, over 4275701.01 frames. ], batch size: 107, lr: 2.96e-03, grad_scale: 8.0 2023-06-27 02:45:12,458 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1708812.0, ans=0.125 2023-06-27 02:45:20,541 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.28 vs. limit=10.0 2023-06-27 02:45:35,569 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.505e+02 7.876e+02 1.206e+03 1.704e+03 3.503e+03, threshold=2.411e+03, percent-clipped=40.0 2023-06-27 02:46:04,721 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1708932.0, ans=0.1 2023-06-27 02:46:28,116 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.99 vs. limit=22.5 2023-06-27 02:46:57,631 INFO [train.py:996] (0/4) Epoch 10, batch 10400, loss[loss=0.1701, simple_loss=0.2217, pruned_loss=0.05922, over 21150.00 frames. ], tot_loss[loss=0.2042, simple_loss=0.2821, pruned_loss=0.0632, over 4264331.01 frames. ], batch size: 143, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 02:47:01,073 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.95 vs. limit=10.0 2023-06-27 02:47:43,358 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1709232.0, ans=0.125 2023-06-27 02:48:01,064 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=11.22 vs. limit=10.0 2023-06-27 02:48:19,873 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1709292.0, ans=0.0 2023-06-27 02:48:34,229 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1709352.0, ans=0.125 2023-06-27 02:48:52,945 INFO [train.py:996] (0/4) Epoch 10, batch 10450, loss[loss=0.2433, simple_loss=0.3183, pruned_loss=0.08416, over 21791.00 frames. ], tot_loss[loss=0.2087, simple_loss=0.2858, pruned_loss=0.06582, over 4269533.11 frames. ], batch size: 118, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 02:49:21,564 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.083e+02 7.279e+02 1.026e+03 1.542e+03 3.103e+03, threshold=2.052e+03, percent-clipped=9.0 2023-06-27 02:49:24,917 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.14 vs. limit=15.0 2023-06-27 02:50:41,344 INFO [train.py:996] (0/4) Epoch 10, batch 10500, loss[loss=0.1789, simple_loss=0.2506, pruned_loss=0.05366, over 21607.00 frames. ], tot_loss[loss=0.2078, simple_loss=0.2862, pruned_loss=0.06466, over 4265204.64 frames. ], batch size: 298, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 02:51:32,950 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.09 vs. limit=10.0 2023-06-27 02:51:34,646 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.49 vs. limit=15.0 2023-06-27 02:51:54,449 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1709892.0, ans=0.125 2023-06-27 02:52:28,659 INFO [train.py:996] (0/4) Epoch 10, batch 10550, loss[loss=0.1759, simple_loss=0.2387, pruned_loss=0.05653, over 21371.00 frames. ], tot_loss[loss=0.2054, simple_loss=0.2816, pruned_loss=0.06461, over 4261708.89 frames. ], batch size: 211, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 02:52:43,492 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.24 vs. limit=6.0 2023-06-27 02:52:55,945 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.583e+02 5.517e+02 8.817e+02 1.298e+03 2.428e+03, threshold=1.763e+03, percent-clipped=4.0 2023-06-27 02:53:14,842 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1710132.0, ans=0.0 2023-06-27 02:53:22,564 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.26 vs. limit=15.0 2023-06-27 02:53:43,535 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1710192.0, ans=0.125 2023-06-27 02:54:00,204 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1710252.0, ans=0.0 2023-06-27 02:54:16,529 INFO [train.py:996] (0/4) Epoch 10, batch 10600, loss[loss=0.1749, simple_loss=0.2663, pruned_loss=0.04177, over 21679.00 frames. ], tot_loss[loss=0.2029, simple_loss=0.2788, pruned_loss=0.06353, over 4257449.38 frames. ], batch size: 298, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 02:54:41,200 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1710372.0, ans=0.2 2023-06-27 02:54:43,178 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1710372.0, ans=0.05 2023-06-27 02:55:05,291 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=13.64 vs. limit=15.0 2023-06-27 02:56:13,019 INFO [train.py:996] (0/4) Epoch 10, batch 10650, loss[loss=0.2169, simple_loss=0.2793, pruned_loss=0.07726, over 19989.00 frames. ], tot_loss[loss=0.2026, simple_loss=0.2799, pruned_loss=0.06268, over 4251585.21 frames. ], batch size: 702, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 02:56:22,669 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1710612.0, ans=0.0 2023-06-27 02:56:26,227 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1710612.0, ans=0.125 2023-06-27 02:56:35,989 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.041e+02 6.303e+02 9.847e+02 1.673e+03 3.050e+03, threshold=1.969e+03, percent-clipped=22.0 2023-06-27 02:56:37,289 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=6.15 vs. limit=22.5 2023-06-27 02:57:33,313 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1710792.0, ans=0.125 2023-06-27 02:58:01,508 INFO [train.py:996] (0/4) Epoch 10, batch 10700, loss[loss=0.2339, simple_loss=0.3133, pruned_loss=0.07723, over 21568.00 frames. ], tot_loss[loss=0.2013, simple_loss=0.2784, pruned_loss=0.06203, over 4257470.05 frames. ], batch size: 389, lr: 2.96e-03, grad_scale: 8.0 2023-06-27 02:58:16,379 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1710912.0, ans=0.1 2023-06-27 02:58:32,419 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1710972.0, ans=0.125 2023-06-27 02:59:45,526 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1711152.0, ans=0.125 2023-06-27 02:59:51,983 INFO [train.py:996] (0/4) Epoch 10, batch 10750, loss[loss=0.2839, simple_loss=0.3795, pruned_loss=0.09418, over 21609.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2891, pruned_loss=0.06563, over 4261159.84 frames. ], batch size: 389, lr: 2.96e-03, grad_scale: 8.0 2023-06-27 03:00:16,674 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1711272.0, ans=0.125 2023-06-27 03:00:21,244 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.422e+02 6.069e+02 8.010e+02 1.380e+03 3.013e+03, threshold=1.602e+03, percent-clipped=10.0 2023-06-27 03:01:17,275 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1711392.0, ans=0.125 2023-06-27 03:01:41,464 INFO [train.py:996] (0/4) Epoch 10, batch 10800, loss[loss=0.285, simple_loss=0.3514, pruned_loss=0.1093, over 21353.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2949, pruned_loss=0.06665, over 4264443.27 frames. ], batch size: 507, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 03:02:13,034 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1711572.0, ans=0.125 2023-06-27 03:02:37,842 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.10 vs. limit=15.0 2023-06-27 03:03:03,225 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1711692.0, ans=0.0 2023-06-27 03:03:30,057 INFO [train.py:996] (0/4) Epoch 10, batch 10850, loss[loss=0.2054, simple_loss=0.276, pruned_loss=0.06737, over 21593.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2944, pruned_loss=0.06725, over 4264352.88 frames. ], batch size: 391, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 03:03:44,163 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1711812.0, ans=0.1 2023-06-27 03:04:00,859 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1711872.0, ans=0.125 2023-06-27 03:04:01,549 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.90 vs. limit=15.0 2023-06-27 03:04:05,456 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.119e+02 5.251e+02 7.747e+02 1.275e+03 2.663e+03, threshold=1.549e+03, percent-clipped=11.0 2023-06-27 03:04:06,255 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1711872.0, ans=0.1 2023-06-27 03:05:01,957 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1712052.0, ans=0.1 2023-06-27 03:05:23,783 INFO [train.py:996] (0/4) Epoch 10, batch 10900, loss[loss=0.2118, simple_loss=0.2911, pruned_loss=0.06623, over 21575.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2899, pruned_loss=0.06526, over 4244246.39 frames. ], batch size: 414, lr: 2.96e-03, grad_scale: 8.0 2023-06-27 03:06:16,336 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.79 vs. limit=12.0 2023-06-27 03:06:18,859 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1712232.0, ans=0.2 2023-06-27 03:06:25,794 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1712232.0, ans=0.1 2023-06-27 03:06:34,683 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1712292.0, ans=0.0 2023-06-27 03:06:50,470 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1712352.0, ans=0.0 2023-06-27 03:06:52,019 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1712352.0, ans=0.0 2023-06-27 03:07:12,334 INFO [train.py:996] (0/4) Epoch 10, batch 10950, loss[loss=0.2127, simple_loss=0.3132, pruned_loss=0.05612, over 19913.00 frames. ], tot_loss[loss=0.2074, simple_loss=0.2873, pruned_loss=0.06375, over 4231123.38 frames. ], batch size: 702, lr: 2.96e-03, grad_scale: 8.0 2023-06-27 03:07:48,602 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.904e+02 6.171e+02 9.007e+02 1.291e+03 2.424e+03, threshold=1.801e+03, percent-clipped=14.0 2023-06-27 03:08:38,689 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1712652.0, ans=0.2 2023-06-27 03:08:58,787 INFO [train.py:996] (0/4) Epoch 10, batch 11000, loss[loss=0.2361, simple_loss=0.2971, pruned_loss=0.08754, over 21248.00 frames. ], tot_loss[loss=0.2082, simple_loss=0.2862, pruned_loss=0.06513, over 4229126.45 frames. ], batch size: 159, lr: 2.96e-03, grad_scale: 8.0 2023-06-27 03:08:59,855 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1712712.0, ans=0.2 2023-06-27 03:09:21,158 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1712772.0, ans=0.1 2023-06-27 03:09:21,605 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.38 vs. limit=15.0 2023-06-27 03:10:18,745 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.34 vs. limit=10.0 2023-06-27 03:10:46,689 INFO [train.py:996] (0/4) Epoch 10, batch 11050, loss[loss=0.1967, simple_loss=0.2647, pruned_loss=0.06434, over 21850.00 frames. ], tot_loss[loss=0.2085, simple_loss=0.284, pruned_loss=0.06649, over 4246190.47 frames. ], batch size: 98, lr: 2.96e-03, grad_scale: 8.0 2023-06-27 03:11:11,306 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1713072.0, ans=0.0 2023-06-27 03:11:22,037 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.001e+02 5.814e+02 8.503e+02 1.206e+03 2.810e+03, threshold=1.701e+03, percent-clipped=7.0 2023-06-27 03:11:24,908 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.58 vs. limit=10.0 2023-06-27 03:11:40,145 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 03:11:56,834 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1713192.0, ans=0.0 2023-06-27 03:12:10,551 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1713252.0, ans=0.1 2023-06-27 03:12:30,491 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1713252.0, ans=0.125 2023-06-27 03:12:33,229 INFO [train.py:996] (0/4) Epoch 10, batch 11100, loss[loss=0.2046, simple_loss=0.2714, pruned_loss=0.06888, over 21871.00 frames. ], tot_loss[loss=0.2081, simple_loss=0.2835, pruned_loss=0.06634, over 4253863.30 frames. ], batch size: 98, lr: 2.96e-03, grad_scale: 8.0 2023-06-27 03:12:59,956 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.56 vs. limit=22.5 2023-06-27 03:13:00,128 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.09 vs. limit=12.0 2023-06-27 03:14:22,307 INFO [train.py:996] (0/4) Epoch 10, batch 11150, loss[loss=0.2098, simple_loss=0.3069, pruned_loss=0.05632, over 21738.00 frames. ], tot_loss[loss=0.2059, simple_loss=0.2803, pruned_loss=0.06582, over 4249783.43 frames. ], batch size: 351, lr: 2.96e-03, grad_scale: 8.0 2023-06-27 03:14:37,199 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1713612.0, ans=0.0 2023-06-27 03:14:58,533 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.769e+02 5.768e+02 8.894e+02 1.400e+03 2.503e+03, threshold=1.779e+03, percent-clipped=10.0 2023-06-27 03:15:29,870 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1713792.0, ans=0.0 2023-06-27 03:15:52,064 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1713852.0, ans=0.0 2023-06-27 03:16:08,615 INFO [train.py:996] (0/4) Epoch 10, batch 11200, loss[loss=0.2276, simple_loss=0.2736, pruned_loss=0.09081, over 21288.00 frames. ], tot_loss[loss=0.205, simple_loss=0.2786, pruned_loss=0.06565, over 4250386.96 frames. ], batch size: 507, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 03:16:35,621 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1713972.0, ans=0.2 2023-06-27 03:16:56,833 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1713972.0, ans=0.04949747468305833 2023-06-27 03:17:10,040 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1714032.0, ans=0.0 2023-06-27 03:17:25,732 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1714092.0, ans=0.07 2023-06-27 03:17:31,425 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.47 vs. limit=6.0 2023-06-27 03:17:45,066 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.55 vs. limit=22.5 2023-06-27 03:17:55,857 INFO [train.py:996] (0/4) Epoch 10, batch 11250, loss[loss=0.2499, simple_loss=0.3057, pruned_loss=0.09701, over 21559.00 frames. ], tot_loss[loss=0.2054, simple_loss=0.2781, pruned_loss=0.06637, over 4254209.97 frames. ], batch size: 471, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 03:18:18,910 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1714272.0, ans=0.125 2023-06-27 03:18:26,748 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.073e+02 5.382e+02 8.145e+02 1.130e+03 2.477e+03, threshold=1.629e+03, percent-clipped=7.0 2023-06-27 03:19:38,929 INFO [train.py:996] (0/4) Epoch 10, batch 11300, loss[loss=0.1867, simple_loss=0.2708, pruned_loss=0.05136, over 21872.00 frames. ], tot_loss[loss=0.2061, simple_loss=0.2799, pruned_loss=0.06616, over 4260150.35 frames. ], batch size: 316, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 03:20:07,990 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1714572.0, ans=0.2 2023-06-27 03:20:27,663 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1714632.0, ans=0.0 2023-06-27 03:20:29,437 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1714632.0, ans=0.125 2023-06-27 03:20:31,201 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1714632.0, ans=0.0 2023-06-27 03:20:53,946 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1714692.0, ans=0.125 2023-06-27 03:21:02,603 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1714692.0, ans=0.125 2023-06-27 03:21:22,955 INFO [train.py:996] (0/4) Epoch 10, batch 11350, loss[loss=0.1742, simple_loss=0.2546, pruned_loss=0.04694, over 21498.00 frames. ], tot_loss[loss=0.2076, simple_loss=0.2825, pruned_loss=0.06634, over 4257667.42 frames. ], batch size: 195, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 03:21:39,931 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1714812.0, ans=0.125 2023-06-27 03:21:57,090 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1714872.0, ans=0.125 2023-06-27 03:22:00,029 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.082e+02 5.912e+02 7.672e+02 1.183e+03 2.053e+03, threshold=1.534e+03, percent-clipped=10.0 2023-06-27 03:22:44,941 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1714992.0, ans=0.125 2023-06-27 03:22:48,533 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1715052.0, ans=0.125 2023-06-27 03:23:12,829 INFO [train.py:996] (0/4) Epoch 10, batch 11400, loss[loss=0.2069, simple_loss=0.2931, pruned_loss=0.06034, over 21623.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.2866, pruned_loss=0.06797, over 4253839.99 frames. ], batch size: 263, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 03:23:50,983 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.62 vs. limit=15.0 2023-06-27 03:23:53,808 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 03:24:29,650 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1715292.0, ans=0.0 2023-06-27 03:24:50,066 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.04 vs. limit=12.0 2023-06-27 03:25:07,620 INFO [train.py:996] (0/4) Epoch 10, batch 11450, loss[loss=0.215, simple_loss=0.2931, pruned_loss=0.06844, over 21280.00 frames. ], tot_loss[loss=0.21, simple_loss=0.2869, pruned_loss=0.06651, over 4257533.47 frames. ], batch size: 176, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 03:25:33,689 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.931e+02 7.490e+02 1.068e+03 1.427e+03 2.700e+03, threshold=2.136e+03, percent-clipped=19.0 2023-06-27 03:25:41,418 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1715532.0, ans=0.125 2023-06-27 03:25:46,854 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1715532.0, ans=0.125 2023-06-27 03:25:50,049 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1715532.0, ans=0.2 2023-06-27 03:26:03,910 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1715592.0, ans=0.1 2023-06-27 03:26:26,778 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1715652.0, ans=0.0 2023-06-27 03:26:49,443 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1715712.0, ans=0.125 2023-06-27 03:26:50,416 INFO [train.py:996] (0/4) Epoch 10, batch 11500, loss[loss=0.1937, simple_loss=0.2807, pruned_loss=0.05337, over 21245.00 frames. ], tot_loss[loss=0.2123, simple_loss=0.29, pruned_loss=0.06734, over 4253379.41 frames. ], batch size: 159, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 03:27:01,599 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1715712.0, ans=0.1 2023-06-27 03:27:10,520 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1715712.0, ans=0.07 2023-06-27 03:28:36,597 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1715952.0, ans=0.0 2023-06-27 03:28:45,002 INFO [train.py:996] (0/4) Epoch 10, batch 11550, loss[loss=0.3302, simple_loss=0.4349, pruned_loss=0.1128, over 21664.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.296, pruned_loss=0.06741, over 4261323.52 frames. ], batch size: 441, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 03:28:56,671 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1716012.0, ans=0.125 2023-06-27 03:29:04,448 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.31 vs. limit=15.0 2023-06-27 03:29:17,158 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.702e+02 7.297e+02 1.033e+03 1.557e+03 3.418e+03, threshold=2.066e+03, percent-clipped=10.0 2023-06-27 03:29:19,483 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1716072.0, ans=0.0 2023-06-27 03:29:30,247 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1716132.0, ans=0.125 2023-06-27 03:29:54,725 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1716192.0, ans=0.125 2023-06-27 03:30:32,992 INFO [train.py:996] (0/4) Epoch 10, batch 11600, loss[loss=0.2494, simple_loss=0.339, pruned_loss=0.07991, over 21241.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.3108, pruned_loss=0.06991, over 4264725.14 frames. ], batch size: 159, lr: 2.95e-03, grad_scale: 32.0 2023-06-27 03:30:39,513 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.70 vs. limit=12.0 2023-06-27 03:30:45,912 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1716312.0, ans=0.2 2023-06-27 03:30:46,349 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.83 vs. limit=15.0 2023-06-27 03:30:49,656 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=1716372.0, ans=15.0 2023-06-27 03:32:11,006 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1716552.0, ans=0.125 2023-06-27 03:32:17,885 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1716552.0, ans=0.07 2023-06-27 03:32:20,576 INFO [train.py:996] (0/4) Epoch 10, batch 11650, loss[loss=0.2436, simple_loss=0.3373, pruned_loss=0.07495, over 21295.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3167, pruned_loss=0.07, over 4258934.72 frames. ], batch size: 176, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 03:32:52,967 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.091e+02 7.350e+02 9.956e+02 1.670e+03 3.528e+03, threshold=1.991e+03, percent-clipped=18.0 2023-06-27 03:33:03,676 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.14 vs. limit=15.0 2023-06-27 03:33:30,854 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1716792.0, ans=0.125 2023-06-27 03:33:31,431 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.59 vs. limit=22.5 2023-06-27 03:33:41,497 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.02 vs. limit=15.0 2023-06-27 03:33:46,326 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.85 vs. limit=15.0 2023-06-27 03:34:04,324 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1716852.0, ans=0.125 2023-06-27 03:34:07,071 INFO [train.py:996] (0/4) Epoch 10, batch 11700, loss[loss=0.1852, simple_loss=0.2502, pruned_loss=0.0601, over 21643.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.3092, pruned_loss=0.06957, over 4255160.39 frames. ], batch size: 248, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 03:34:56,701 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.40 vs. limit=6.0 2023-06-27 03:34:57,868 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1717032.0, ans=0.0 2023-06-27 03:35:53,413 INFO [train.py:996] (0/4) Epoch 10, batch 11750, loss[loss=0.2161, simple_loss=0.3126, pruned_loss=0.0598, over 19868.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.2993, pruned_loss=0.06889, over 4250904.82 frames. ], batch size: 702, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 03:35:57,473 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1717212.0, ans=0.125 2023-06-27 03:36:19,332 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=16.74 vs. limit=15.0 2023-06-27 03:36:26,192 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.050e+02 5.774e+02 7.571e+02 1.065e+03 1.774e+03, threshold=1.514e+03, percent-clipped=0.0 2023-06-27 03:36:43,994 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.14 vs. limit=12.0 2023-06-27 03:36:57,303 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1717332.0, ans=0.125 2023-06-27 03:37:42,104 INFO [train.py:996] (0/4) Epoch 10, batch 11800, loss[loss=0.2244, simple_loss=0.3018, pruned_loss=0.07348, over 21452.00 frames. ], tot_loss[loss=0.22, simple_loss=0.2992, pruned_loss=0.07039, over 4252571.32 frames. ], batch size: 211, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 03:38:29,962 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1717632.0, ans=0.0 2023-06-27 03:38:38,774 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1717632.0, ans=0.125 2023-06-27 03:39:30,356 INFO [train.py:996] (0/4) Epoch 10, batch 11850, loss[loss=0.2444, simple_loss=0.3321, pruned_loss=0.07842, over 21653.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.3013, pruned_loss=0.06973, over 4261489.43 frames. ], batch size: 441, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 03:39:31,510 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.49 vs. limit=22.5 2023-06-27 03:40:09,304 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.078e+02 6.779e+02 9.644e+02 1.423e+03 2.292e+03, threshold=1.929e+03, percent-clipped=21.0 2023-06-27 03:40:28,135 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1717932.0, ans=0.2 2023-06-27 03:40:48,134 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.72 vs. limit=22.5 2023-06-27 03:40:57,977 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1717992.0, ans=0.0 2023-06-27 03:41:16,564 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1718052.0, ans=0.0 2023-06-27 03:41:25,966 INFO [train.py:996] (0/4) Epoch 10, batch 11900, loss[loss=0.1772, simple_loss=0.2544, pruned_loss=0.05001, over 21327.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.302, pruned_loss=0.06724, over 4259169.00 frames. ], batch size: 131, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 03:41:48,357 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1718172.0, ans=0.1 2023-06-27 03:42:21,992 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.26 vs. limit=22.5 2023-06-27 03:42:30,363 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1718232.0, ans=0.1 2023-06-27 03:42:39,246 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1718292.0, ans=0.0 2023-06-27 03:42:51,524 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1718352.0, ans=10.0 2023-06-27 03:43:15,225 INFO [train.py:996] (0/4) Epoch 10, batch 11950, loss[loss=0.1847, simple_loss=0.2845, pruned_loss=0.04245, over 21741.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.3025, pruned_loss=0.065, over 4256908.75 frames. ], batch size: 351, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 03:43:30,382 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1718412.0, ans=0.125 2023-06-27 03:43:53,620 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.803e+02 5.577e+02 8.393e+02 1.338e+03 3.088e+03, threshold=1.679e+03, percent-clipped=11.0 2023-06-27 03:44:11,759 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1718532.0, ans=0.125 2023-06-27 03:45:09,401 INFO [train.py:996] (0/4) Epoch 10, batch 12000, loss[loss=0.1814, simple_loss=0.2444, pruned_loss=0.05913, over 21286.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.2958, pruned_loss=0.06334, over 4261848.64 frames. ], batch size: 551, lr: 2.95e-03, grad_scale: 32.0 2023-06-27 03:45:09,402 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-27 03:45:19,532 INFO [zipformer.py:1728] (0/4) name=encoder.encoders.2.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([1.8698, 4.2020, 4.0488, 4.2395], device='cuda:0') 2023-06-27 03:45:30,594 INFO [train.py:1028] (0/4) Epoch 10, validation: loss=0.2595, simple_loss=0.3509, pruned_loss=0.08412, over 1796401.00 frames. 2023-06-27 03:45:30,595 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23714MB 2023-06-27 03:46:22,631 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1718832.0, ans=0.0 2023-06-27 03:46:24,475 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1718832.0, ans=0.1 2023-06-27 03:46:31,322 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1718892.0, ans=0.0 2023-06-27 03:47:18,643 INFO [train.py:996] (0/4) Epoch 10, batch 12050, loss[loss=0.2079, simple_loss=0.2689, pruned_loss=0.07348, over 21596.00 frames. ], tot_loss[loss=0.211, simple_loss=0.2921, pruned_loss=0.065, over 4265961.13 frames. ], batch size: 548, lr: 2.95e-03, grad_scale: 32.0 2023-06-27 03:47:53,493 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.219e+02 6.182e+02 8.249e+02 1.335e+03 3.065e+03, threshold=1.650e+03, percent-clipped=10.0 2023-06-27 03:48:17,486 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.91 vs. limit=15.0 2023-06-27 03:48:27,607 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1719192.0, ans=0.05 2023-06-27 03:48:52,659 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1719252.0, ans=0.125 2023-06-27 03:49:08,220 INFO [train.py:996] (0/4) Epoch 10, batch 12100, loss[loss=0.237, simple_loss=0.3215, pruned_loss=0.07628, over 21637.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2962, pruned_loss=0.06801, over 4270897.89 frames. ], batch size: 389, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 03:49:45,628 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.74 vs. limit=10.0 2023-06-27 03:50:07,751 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.20 vs. limit=22.5 2023-06-27 03:50:31,039 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1719492.0, ans=0.125 2023-06-27 03:50:34,505 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1719492.0, ans=0.125 2023-06-27 03:50:59,287 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1719552.0, ans=0.125 2023-06-27 03:51:06,036 INFO [train.py:996] (0/4) Epoch 10, batch 12150, loss[loss=0.2068, simple_loss=0.2934, pruned_loss=0.06015, over 21839.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.3005, pruned_loss=0.06802, over 4274728.83 frames. ], batch size: 316, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 03:51:10,283 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1719612.0, ans=0.125 2023-06-27 03:51:12,178 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1719612.0, ans=0.1 2023-06-27 03:51:40,996 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.275e+02 6.507e+02 9.290e+02 1.300e+03 3.036e+03, threshold=1.858e+03, percent-clipped=15.0 2023-06-27 03:51:56,496 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.96 vs. limit=15.0 2023-06-27 03:52:02,637 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 03:52:53,538 INFO [train.py:996] (0/4) Epoch 10, batch 12200, loss[loss=0.2327, simple_loss=0.2758, pruned_loss=0.09477, over 21331.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2973, pruned_loss=0.06649, over 4266893.34 frames. ], batch size: 508, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 03:53:17,011 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1719972.0, ans=0.125 2023-06-27 03:53:28,746 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1719972.0, ans=0.09899494936611666 2023-06-27 03:53:56,035 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 03:54:29,450 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1720152.0, ans=0.0 2023-06-27 03:54:40,552 INFO [train.py:996] (0/4) Epoch 10, batch 12250, loss[loss=0.1859, simple_loss=0.2426, pruned_loss=0.06456, over 20791.00 frames. ], tot_loss[loss=0.2087, simple_loss=0.2891, pruned_loss=0.06412, over 4263730.68 frames. ], batch size: 608, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 03:54:45,690 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1720212.0, ans=0.1 2023-06-27 03:55:08,244 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=1720272.0, ans=0.05 2023-06-27 03:55:14,851 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.738e+02 5.320e+02 7.726e+02 1.159e+03 2.410e+03, threshold=1.545e+03, percent-clipped=8.0 2023-06-27 03:56:02,164 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.43 vs. limit=22.5 2023-06-27 03:56:05,051 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 03:56:28,169 INFO [train.py:996] (0/4) Epoch 10, batch 12300, loss[loss=0.2091, simple_loss=0.3045, pruned_loss=0.05681, over 21841.00 frames. ], tot_loss[loss=0.202, simple_loss=0.2836, pruned_loss=0.06015, over 4248167.06 frames. ], batch size: 316, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 03:56:42,807 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1720512.0, ans=0.2 2023-06-27 03:57:05,270 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1720572.0, ans=0.125 2023-06-27 03:58:16,035 INFO [train.py:996] (0/4) Epoch 10, batch 12350, loss[loss=0.2307, simple_loss=0.3043, pruned_loss=0.07851, over 21523.00 frames. ], tot_loss[loss=0.2056, simple_loss=0.2883, pruned_loss=0.06144, over 4254779.92 frames. ], batch size: 548, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 03:58:50,881 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.592e+02 6.371e+02 1.042e+03 1.645e+03 3.511e+03, threshold=2.083e+03, percent-clipped=28.0 2023-06-27 04:00:04,503 INFO [train.py:996] (0/4) Epoch 10, batch 12400, loss[loss=0.1971, simple_loss=0.3207, pruned_loss=0.03678, over 19894.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2925, pruned_loss=0.06414, over 4260111.82 frames. ], batch size: 703, lr: 2.95e-03, grad_scale: 32.0 2023-06-27 04:00:45,681 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1721172.0, ans=0.05 2023-06-27 04:01:08,541 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1721232.0, ans=0.0 2023-06-27 04:01:13,890 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.18 vs. limit=22.5 2023-06-27 04:01:17,243 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1721292.0, ans=0.125 2023-06-27 04:01:43,293 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1721352.0, ans=0.125 2023-06-27 04:01:58,702 INFO [train.py:996] (0/4) Epoch 10, batch 12450, loss[loss=0.2515, simple_loss=0.3287, pruned_loss=0.08718, over 21610.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2951, pruned_loss=0.0668, over 4267144.06 frames. ], batch size: 389, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 04:02:10,524 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.44 vs. limit=22.5 2023-06-27 04:02:36,088 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.371e+02 6.019e+02 7.668e+02 9.401e+02 2.639e+03, threshold=1.534e+03, percent-clipped=2.0 2023-06-27 04:03:25,093 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.76 vs. limit=15.0 2023-06-27 04:03:47,753 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1721712.0, ans=0.125 2023-06-27 04:03:48,673 INFO [train.py:996] (0/4) Epoch 10, batch 12500, loss[loss=0.2685, simple_loss=0.3602, pruned_loss=0.08837, over 21437.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.3025, pruned_loss=0.06956, over 4269391.80 frames. ], batch size: 211, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 04:03:51,620 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.75 vs. limit=22.5 2023-06-27 04:04:09,962 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1721772.0, ans=0.2 2023-06-27 04:04:11,953 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1721772.0, ans=0.0 2023-06-27 04:05:03,547 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1721892.0, ans=0.1 2023-06-27 04:05:45,546 INFO [train.py:996] (0/4) Epoch 10, batch 12550, loss[loss=0.2327, simple_loss=0.3231, pruned_loss=0.07113, over 21422.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.3084, pruned_loss=0.07153, over 4265865.30 frames. ], batch size: 131, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 04:05:57,537 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=6.17 vs. limit=22.5 2023-06-27 04:06:27,276 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.271e+02 6.681e+02 8.893e+02 1.594e+03 3.232e+03, threshold=1.779e+03, percent-clipped=27.0 2023-06-27 04:06:28,138 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1722072.0, ans=0.125 2023-06-27 04:06:34,313 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.91 vs. limit=6.0 2023-06-27 04:07:11,953 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.90 vs. limit=22.5 2023-06-27 04:07:39,578 INFO [train.py:996] (0/4) Epoch 10, batch 12600, loss[loss=0.19, simple_loss=0.2795, pruned_loss=0.05026, over 21771.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.3066, pruned_loss=0.07063, over 4264351.54 frames. ], batch size: 282, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 04:08:15,037 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.70 vs. limit=22.5 2023-06-27 04:09:12,158 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=16.35 vs. limit=22.5 2023-06-27 04:09:20,827 INFO [train.py:996] (0/4) Epoch 10, batch 12650, loss[loss=0.2118, simple_loss=0.2838, pruned_loss=0.06988, over 21829.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2981, pruned_loss=0.06715, over 4272639.82 frames. ], batch size: 371, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 04:09:34,112 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.38 vs. limit=15.0 2023-06-27 04:09:47,484 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1722672.0, ans=0.2 2023-06-27 04:10:02,122 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.614e+02 6.359e+02 1.024e+03 1.411e+03 2.503e+03, threshold=2.048e+03, percent-clipped=9.0 2023-06-27 04:10:06,118 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1722732.0, ans=0.125 2023-06-27 04:11:14,827 INFO [train.py:996] (0/4) Epoch 10, batch 12700, loss[loss=0.2445, simple_loss=0.3189, pruned_loss=0.08503, over 21621.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2973, pruned_loss=0.06881, over 4280687.27 frames. ], batch size: 389, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 04:11:45,150 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1722972.0, ans=0.1 2023-06-27 04:12:18,006 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1723092.0, ans=0.0 2023-06-27 04:13:06,384 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.36 vs. limit=15.0 2023-06-27 04:13:08,227 INFO [train.py:996] (0/4) Epoch 10, batch 12750, loss[loss=0.188, simple_loss=0.28, pruned_loss=0.04794, over 21703.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2976, pruned_loss=0.06874, over 4273982.06 frames. ], batch size: 247, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 04:13:13,955 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1723212.0, ans=0.1 2023-06-27 04:13:37,766 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1723272.0, ans=0.1 2023-06-27 04:13:38,771 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.282e+02 6.128e+02 7.827e+02 1.074e+03 2.616e+03, threshold=1.565e+03, percent-clipped=3.0 2023-06-27 04:14:54,511 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1723512.0, ans=0.0 2023-06-27 04:14:55,470 INFO [train.py:996] (0/4) Epoch 10, batch 12800, loss[loss=0.1944, simple_loss=0.2715, pruned_loss=0.05863, over 21648.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.2978, pruned_loss=0.06941, over 4277902.96 frames. ], batch size: 263, lr: 2.95e-03, grad_scale: 32.0 2023-06-27 04:15:44,071 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.44 vs. limit=22.5 2023-06-27 04:16:15,984 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1723692.0, ans=0.025 2023-06-27 04:16:45,031 INFO [train.py:996] (0/4) Epoch 10, batch 12850, loss[loss=0.1885, simple_loss=0.2763, pruned_loss=0.05042, over 21434.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.3008, pruned_loss=0.07142, over 4277716.47 frames. ], batch size: 211, lr: 2.95e-03, grad_scale: 32.0 2023-06-27 04:16:58,407 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 04:17:22,027 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.972e+02 5.917e+02 7.824e+02 1.083e+03 2.191e+03, threshold=1.565e+03, percent-clipped=6.0 2023-06-27 04:17:26,816 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.40 vs. limit=10.0 2023-06-27 04:17:29,988 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1723932.0, ans=0.0 2023-06-27 04:18:05,074 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1723992.0, ans=0.1 2023-06-27 04:18:08,450 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1723992.0, ans=0.0 2023-06-27 04:18:23,188 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.38 vs. limit=15.0 2023-06-27 04:18:26,313 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1724052.0, ans=0.125 2023-06-27 04:18:34,528 INFO [train.py:996] (0/4) Epoch 10, batch 12900, loss[loss=0.2085, simple_loss=0.3136, pruned_loss=0.05169, over 21246.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2968, pruned_loss=0.06709, over 4283170.48 frames. ], batch size: 548, lr: 2.95e-03, grad_scale: 32.0 2023-06-27 04:19:46,180 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1724232.0, ans=0.125 2023-06-27 04:20:01,509 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1724292.0, ans=0.0 2023-06-27 04:20:06,316 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1724352.0, ans=0.015 2023-06-27 04:20:08,070 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1724352.0, ans=0.125 2023-06-27 04:20:09,861 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1724352.0, ans=0.04949747468305833 2023-06-27 04:20:23,528 INFO [train.py:996] (0/4) Epoch 10, batch 12950, loss[loss=0.2049, simple_loss=0.2895, pruned_loss=0.06012, over 21792.00 frames. ], tot_loss[loss=0.215, simple_loss=0.2971, pruned_loss=0.0665, over 4282118.90 frames. ], batch size: 282, lr: 2.95e-03, grad_scale: 32.0 2023-06-27 04:21:19,214 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.055e+02 6.814e+02 9.301e+02 1.537e+03 3.645e+03, threshold=1.860e+03, percent-clipped=23.0 2023-06-27 04:21:40,911 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.98 vs. limit=12.0 2023-06-27 04:21:48,972 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1724592.0, ans=0.1 2023-06-27 04:21:52,647 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1724592.0, ans=0.0 2023-06-27 04:22:09,617 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1724652.0, ans=0.0 2023-06-27 04:22:17,943 INFO [train.py:996] (0/4) Epoch 10, batch 13000, loss[loss=0.2512, simple_loss=0.3226, pruned_loss=0.08989, over 21421.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2972, pruned_loss=0.06731, over 4281958.29 frames. ], batch size: 507, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 04:22:21,888 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1724712.0, ans=0.025 2023-06-27 04:22:27,390 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.16 vs. limit=12.0 2023-06-27 04:22:50,939 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1724772.0, ans=0.0 2023-06-27 04:23:21,157 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1724832.0, ans=0.125 2023-06-27 04:24:05,880 INFO [train.py:996] (0/4) Epoch 10, batch 13050, loss[loss=0.1927, simple_loss=0.2671, pruned_loss=0.05915, over 21655.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.2924, pruned_loss=0.06506, over 4279057.09 frames. ], batch size: 230, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 04:24:21,644 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1725072.0, ans=0.0 2023-06-27 04:24:46,567 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1725072.0, ans=0.1 2023-06-27 04:24:49,086 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.711e+02 5.473e+02 7.954e+02 1.041e+03 2.275e+03, threshold=1.591e+03, percent-clipped=1.0 2023-06-27 04:25:20,787 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1725192.0, ans=0.125 2023-06-27 04:25:22,467 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1725192.0, ans=0.125 2023-06-27 04:25:26,542 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.58 vs. limit=15.0 2023-06-27 04:25:53,812 INFO [train.py:996] (0/4) Epoch 10, batch 13100, loss[loss=0.2171, simple_loss=0.3052, pruned_loss=0.06445, over 21612.00 frames. ], tot_loss[loss=0.2125, simple_loss=0.2945, pruned_loss=0.06526, over 4281239.43 frames. ], batch size: 389, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 04:26:25,786 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1725372.0, ans=0.1 2023-06-27 04:27:02,617 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1725492.0, ans=0.2 2023-06-27 04:27:02,746 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1725492.0, ans=0.0 2023-06-27 04:27:43,054 INFO [train.py:996] (0/4) Epoch 10, batch 13150, loss[loss=0.1814, simple_loss=0.2608, pruned_loss=0.05101, over 21746.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2995, pruned_loss=0.0665, over 4270530.40 frames. ], batch size: 282, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 04:28:07,428 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1725612.0, ans=0.125 2023-06-27 04:28:15,851 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1725672.0, ans=0.125 2023-06-27 04:28:24,530 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1725672.0, ans=0.125 2023-06-27 04:28:32,058 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.070e+02 6.134e+02 8.116e+02 1.164e+03 2.711e+03, threshold=1.623e+03, percent-clipped=9.0 2023-06-27 04:28:32,722 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 04:28:39,841 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1725732.0, ans=0.0 2023-06-27 04:28:39,865 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1725732.0, ans=0.125 2023-06-27 04:28:52,801 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.09 vs. limit=15.0 2023-06-27 04:29:03,064 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.53 vs. limit=12.0 2023-06-27 04:29:37,433 INFO [train.py:996] (0/4) Epoch 10, batch 13200, loss[loss=0.2189, simple_loss=0.298, pruned_loss=0.0699, over 21746.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2985, pruned_loss=0.06687, over 4270403.09 frames. ], batch size: 332, lr: 2.95e-03, grad_scale: 32.0 2023-06-27 04:29:40,250 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.96 vs. limit=12.0 2023-06-27 04:29:58,799 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1725972.0, ans=0.0 2023-06-27 04:30:00,479 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1725972.0, ans=0.0 2023-06-27 04:30:20,143 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1726032.0, ans=0.07 2023-06-27 04:30:48,074 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1726092.0, ans=0.125 2023-06-27 04:30:58,706 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 04:31:19,050 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=1726152.0, ans=15.0 2023-06-27 04:31:26,746 INFO [train.py:996] (0/4) Epoch 10, batch 13250, loss[loss=0.2228, simple_loss=0.3043, pruned_loss=0.07064, over 21670.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2971, pruned_loss=0.0686, over 4269909.02 frames. ], batch size: 389, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 04:31:27,488 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1726212.0, ans=0.125 2023-06-27 04:31:44,791 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1726212.0, ans=0.125 2023-06-27 04:31:46,313 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1726212.0, ans=0.1 2023-06-27 04:31:59,964 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1726272.0, ans=0.2 2023-06-27 04:32:06,248 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.999e+02 7.655e+02 1.062e+03 1.668e+03 3.650e+03, threshold=2.123e+03, percent-clipped=27.0 2023-06-27 04:32:24,541 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1726332.0, ans=0.1 2023-06-27 04:32:28,475 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1726392.0, ans=0.125 2023-06-27 04:32:33,760 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1726392.0, ans=0.125 2023-06-27 04:32:53,873 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1726392.0, ans=0.125 2023-06-27 04:33:21,194 INFO [train.py:996] (0/4) Epoch 10, batch 13300, loss[loss=0.2341, simple_loss=0.318, pruned_loss=0.07505, over 21921.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.2994, pruned_loss=0.06855, over 4274574.33 frames. ], batch size: 316, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 04:33:30,765 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1726512.0, ans=0.0 2023-06-27 04:33:37,373 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1726572.0, ans=0.125 2023-06-27 04:33:47,919 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 04:33:49,466 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1726572.0, ans=0.125 2023-06-27 04:34:12,565 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1726632.0, ans=0.125 2023-06-27 04:34:58,909 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1726752.0, ans=0.125 2023-06-27 04:35:10,295 INFO [train.py:996] (0/4) Epoch 10, batch 13350, loss[loss=0.2238, simple_loss=0.2952, pruned_loss=0.07617, over 20652.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.3037, pruned_loss=0.07101, over 4273079.14 frames. ], batch size: 608, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 04:35:22,762 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 04:35:43,141 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1726872.0, ans=0.1 2023-06-27 04:35:48,974 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.150e+02 5.865e+02 7.490e+02 1.135e+03 2.182e+03, threshold=1.498e+03, percent-clipped=1.0 2023-06-27 04:36:55,672 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1727052.0, ans=0.125 2023-06-27 04:36:58,410 INFO [train.py:996] (0/4) Epoch 10, batch 13400, loss[loss=0.202, simple_loss=0.2787, pruned_loss=0.06262, over 21468.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.3043, pruned_loss=0.07233, over 4274470.30 frames. ], batch size: 131, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 04:37:45,581 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.24 vs. limit=6.0 2023-06-27 04:38:08,381 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1727292.0, ans=0.0 2023-06-27 04:38:34,361 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 04:38:47,931 INFO [train.py:996] (0/4) Epoch 10, batch 13450, loss[loss=0.2186, simple_loss=0.2885, pruned_loss=0.07434, over 21158.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.3055, pruned_loss=0.07465, over 4282406.87 frames. ], batch size: 143, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 04:39:06,115 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1727412.0, ans=0.125 2023-06-27 04:39:39,485 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.261e+02 5.946e+02 7.827e+02 1.298e+03 2.826e+03, threshold=1.565e+03, percent-clipped=16.0 2023-06-27 04:39:40,268 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1727532.0, ans=0.05 2023-06-27 04:39:45,690 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1727532.0, ans=0.035 2023-06-27 04:39:49,787 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.72 vs. limit=15.0 2023-06-27 04:40:37,092 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1727652.0, ans=0.0 2023-06-27 04:40:43,702 INFO [train.py:996] (0/4) Epoch 10, batch 13500, loss[loss=0.1541, simple_loss=0.2191, pruned_loss=0.04455, over 21341.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.297, pruned_loss=0.07146, over 4279711.60 frames. ], batch size: 176, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 04:40:51,842 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.57 vs. limit=15.0 2023-06-27 04:40:54,621 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1727712.0, ans=0.1 2023-06-27 04:41:02,395 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1727712.0, ans=0.2 2023-06-27 04:41:54,543 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1727892.0, ans=0.125 2023-06-27 04:42:00,438 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.60 vs. limit=15.0 2023-06-27 04:42:28,922 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-288000.pt 2023-06-27 04:42:35,502 INFO [train.py:996] (0/4) Epoch 10, batch 13550, loss[loss=0.2324, simple_loss=0.3365, pruned_loss=0.06416, over 21739.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.3006, pruned_loss=0.07151, over 4274963.37 frames. ], batch size: 298, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 04:43:25,547 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.534e+02 7.345e+02 1.395e+03 2.191e+03 3.934e+03, threshold=2.790e+03, percent-clipped=45.0 2023-06-27 04:43:55,311 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1728192.0, ans=0.125 2023-06-27 04:43:56,976 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1728192.0, ans=0.125 2023-06-27 04:44:12,030 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1728252.0, ans=0.125 2023-06-27 04:44:21,781 INFO [train.py:996] (0/4) Epoch 10, batch 13600, loss[loss=0.196, simple_loss=0.274, pruned_loss=0.05894, over 21586.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.3007, pruned_loss=0.07186, over 4285067.92 frames. ], batch size: 263, lr: 2.94e-03, grad_scale: 32.0 2023-06-27 04:45:30,343 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.06 vs. limit=15.0 2023-06-27 04:45:57,464 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1728552.0, ans=0.2 2023-06-27 04:45:57,910 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.99 vs. limit=22.5 2023-06-27 04:46:00,651 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1728552.0, ans=0.0 2023-06-27 04:46:13,945 INFO [train.py:996] (0/4) Epoch 10, batch 13650, loss[loss=0.1761, simple_loss=0.2502, pruned_loss=0.051, over 21627.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.295, pruned_loss=0.06895, over 4272651.86 frames. ], batch size: 247, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 04:46:18,283 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1728612.0, ans=0.0 2023-06-27 04:46:40,490 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1728672.0, ans=0.2 2023-06-27 04:46:59,893 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.764e+02 5.044e+02 6.157e+02 8.736e+02 2.830e+03, threshold=1.231e+03, percent-clipped=2.0 2023-06-27 04:47:48,717 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1728852.0, ans=0.1 2023-06-27 04:48:02,143 INFO [train.py:996] (0/4) Epoch 10, batch 13700, loss[loss=0.1788, simple_loss=0.2446, pruned_loss=0.0565, over 21391.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.289, pruned_loss=0.06821, over 4272360.56 frames. ], batch size: 131, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 04:48:02,859 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1728912.0, ans=0.125 2023-06-27 04:48:06,748 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.47 vs. limit=22.5 2023-06-27 04:48:51,754 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1729032.0, ans=10.0 2023-06-27 04:49:18,106 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1729092.0, ans=0.1 2023-06-27 04:49:23,342 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1729092.0, ans=0.0 2023-06-27 04:49:50,692 INFO [train.py:996] (0/4) Epoch 10, batch 13750, loss[loss=0.1839, simple_loss=0.2585, pruned_loss=0.05459, over 21295.00 frames. ], tot_loss[loss=0.2118, simple_loss=0.2881, pruned_loss=0.0677, over 4268516.98 frames. ], batch size: 159, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 04:50:29,040 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.68 vs. limit=15.0 2023-06-27 04:50:44,256 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.944e+02 7.619e+02 1.226e+03 1.767e+03 3.252e+03, threshold=2.451e+03, percent-clipped=47.0 2023-06-27 04:50:46,721 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1729332.0, ans=0.125 2023-06-27 04:51:00,310 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.30 vs. limit=22.5 2023-06-27 04:51:34,194 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1729452.0, ans=0.0 2023-06-27 04:51:52,081 INFO [train.py:996] (0/4) Epoch 10, batch 13800, loss[loss=0.2595, simple_loss=0.3723, pruned_loss=0.07337, over 19773.00 frames. ], tot_loss[loss=0.214, simple_loss=0.294, pruned_loss=0.06697, over 4262389.41 frames. ], batch size: 703, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 04:53:40,115 INFO [train.py:996] (0/4) Epoch 10, batch 13850, loss[loss=0.262, simple_loss=0.3444, pruned_loss=0.08982, over 21735.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.3005, pruned_loss=0.06813, over 4264573.08 frames. ], batch size: 351, lr: 2.94e-03, grad_scale: 8.0 2023-06-27 04:53:40,696 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1729812.0, ans=0.2 2023-06-27 04:53:44,297 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1729812.0, ans=0.125 2023-06-27 04:54:23,609 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.256e+02 7.886e+02 1.223e+03 1.813e+03 4.044e+03, threshold=2.445e+03, percent-clipped=7.0 2023-06-27 04:54:39,656 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1729932.0, ans=0.0 2023-06-27 04:55:28,104 INFO [train.py:996] (0/4) Epoch 10, batch 13900, loss[loss=0.2355, simple_loss=0.3096, pruned_loss=0.08066, over 21264.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.3042, pruned_loss=0.07038, over 4267178.07 frames. ], batch size: 143, lr: 2.94e-03, grad_scale: 8.0 2023-06-27 04:55:53,145 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.26 vs. limit=15.0 2023-06-27 04:56:09,583 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1730232.0, ans=0.125 2023-06-27 04:57:14,356 INFO [train.py:996] (0/4) Epoch 10, batch 13950, loss[loss=0.2611, simple_loss=0.3399, pruned_loss=0.09118, over 21876.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.3041, pruned_loss=0.07233, over 4277802.18 frames. ], batch size: 107, lr: 2.94e-03, grad_scale: 8.0 2023-06-27 04:57:29,300 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1730412.0, ans=0.1 2023-06-27 04:57:42,734 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1730472.0, ans=0.125 2023-06-27 04:58:02,047 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.602e+02 6.616e+02 8.570e+02 1.217e+03 2.156e+03, threshold=1.714e+03, percent-clipped=0.0 2023-06-27 04:58:33,184 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 04:58:59,362 INFO [train.py:996] (0/4) Epoch 10, batch 14000, loss[loss=0.1559, simple_loss=0.2295, pruned_loss=0.04113, over 21383.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.3003, pruned_loss=0.0703, over 4271487.57 frames. ], batch size: 160, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 04:59:06,893 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1730712.0, ans=0.125 2023-06-27 05:00:33,468 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1730952.0, ans=0.1 2023-06-27 05:00:40,359 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1730952.0, ans=0.04949747468305833 2023-06-27 05:00:51,602 INFO [train.py:996] (0/4) Epoch 10, batch 14050, loss[loss=0.1971, simple_loss=0.2658, pruned_loss=0.06414, over 21768.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.2956, pruned_loss=0.06683, over 4280376.26 frames. ], batch size: 351, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 05:01:24,965 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.98 vs. limit=15.0 2023-06-27 05:01:33,549 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.649e+02 7.272e+02 1.104e+03 1.609e+03 3.327e+03, threshold=2.207e+03, percent-clipped=18.0 2023-06-27 05:01:58,833 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=1731192.0, ans=0.5 2023-06-27 05:02:27,190 INFO [train.py:996] (0/4) Epoch 10, batch 14100, loss[loss=0.2071, simple_loss=0.319, pruned_loss=0.04753, over 19819.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2904, pruned_loss=0.06637, over 4264900.92 frames. ], batch size: 704, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 05:02:43,402 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1731312.0, ans=0.1 2023-06-27 05:02:48,606 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1731372.0, ans=0.1 2023-06-27 05:03:04,266 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1731372.0, ans=0.07 2023-06-27 05:03:19,396 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1731432.0, ans=0.0 2023-06-27 05:03:26,128 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1731432.0, ans=0.0 2023-06-27 05:03:43,756 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1731492.0, ans=0.0 2023-06-27 05:03:52,910 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.30 vs. limit=22.5 2023-06-27 05:04:12,993 INFO [train.py:996] (0/4) Epoch 10, batch 14150, loss[loss=0.2134, simple_loss=0.3022, pruned_loss=0.0623, over 21658.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2926, pruned_loss=0.06707, over 4264797.29 frames. ], batch size: 230, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 05:04:18,585 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1731612.0, ans=0.0 2023-06-27 05:04:59,064 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.088e+02 7.057e+02 1.107e+03 1.740e+03 3.584e+03, threshold=2.215e+03, percent-clipped=8.0 2023-06-27 05:04:59,665 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1731732.0, ans=0.2 2023-06-27 05:05:06,941 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.45 vs. limit=10.0 2023-06-27 05:05:15,070 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.17 vs. limit=10.0 2023-06-27 05:05:27,226 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1731792.0, ans=0.0 2023-06-27 05:05:48,282 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1731852.0, ans=0.09899494936611666 2023-06-27 05:05:55,687 INFO [train.py:996] (0/4) Epoch 10, batch 14200, loss[loss=0.2166, simple_loss=0.2756, pruned_loss=0.07873, over 21776.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2922, pruned_loss=0.06665, over 4269725.30 frames. ], batch size: 371, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 05:06:04,713 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 05:06:11,554 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1731972.0, ans=0.125 2023-06-27 05:06:54,038 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1732032.0, ans=0.125 2023-06-27 05:07:41,165 INFO [train.py:996] (0/4) Epoch 10, batch 14250, loss[loss=0.1772, simple_loss=0.2659, pruned_loss=0.04423, over 21662.00 frames. ], tot_loss[loss=0.2095, simple_loss=0.2873, pruned_loss=0.06582, over 4267238.13 frames. ], batch size: 415, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 05:07:54,657 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1732212.0, ans=0.125 2023-06-27 05:08:15,453 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1732272.0, ans=0.07 2023-06-27 05:08:29,677 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1732332.0, ans=0.0 2023-06-27 05:08:32,538 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.770e+02 5.743e+02 8.448e+02 1.114e+03 2.445e+03, threshold=1.690e+03, percent-clipped=1.0 2023-06-27 05:08:38,568 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1732332.0, ans=0.0 2023-06-27 05:09:25,895 INFO [train.py:996] (0/4) Epoch 10, batch 14300, loss[loss=0.2407, simple_loss=0.3358, pruned_loss=0.07276, over 21727.00 frames. ], tot_loss[loss=0.2087, simple_loss=0.2879, pruned_loss=0.06478, over 4265599.54 frames. ], batch size: 332, lr: 2.94e-03, grad_scale: 8.0 2023-06-27 05:10:05,285 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1732572.0, ans=0.125 2023-06-27 05:10:09,361 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.13 vs. limit=15.0 2023-06-27 05:10:18,029 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.99 vs. limit=6.0 2023-06-27 05:10:45,526 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1732692.0, ans=0.125 2023-06-27 05:11:14,218 INFO [train.py:996] (0/4) Epoch 10, batch 14350, loss[loss=0.2355, simple_loss=0.3243, pruned_loss=0.07336, over 21559.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2927, pruned_loss=0.06634, over 4259849.84 frames. ], batch size: 471, lr: 2.94e-03, grad_scale: 8.0 2023-06-27 05:11:15,020 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1732812.0, ans=0.0 2023-06-27 05:11:23,882 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1732812.0, ans=0.125 2023-06-27 05:11:25,383 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1732812.0, ans=0.0 2023-06-27 05:11:35,595 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1732872.0, ans=0.125 2023-06-27 05:11:40,496 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1732872.0, ans=0.125 2023-06-27 05:12:00,633 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.66 vs. limit=15.0 2023-06-27 05:12:04,241 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.25 vs. limit=10.0 2023-06-27 05:12:04,599 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.018e+02 7.754e+02 1.154e+03 1.779e+03 3.670e+03, threshold=2.308e+03, percent-clipped=30.0 2023-06-27 05:12:08,650 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1732932.0, ans=0.125 2023-06-27 05:13:00,602 INFO [train.py:996] (0/4) Epoch 10, batch 14400, loss[loss=0.1993, simple_loss=0.2788, pruned_loss=0.05992, over 16494.00 frames. ], tot_loss[loss=0.2122, simple_loss=0.291, pruned_loss=0.06671, over 4262567.58 frames. ], batch size: 60, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 05:13:06,169 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1733112.0, ans=0.0 2023-06-27 05:13:35,119 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1733172.0, ans=0.125 2023-06-27 05:14:14,539 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1733292.0, ans=0.125 2023-06-27 05:14:15,003 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.17 vs. limit=6.0 2023-06-27 05:14:31,875 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1733352.0, ans=0.0 2023-06-27 05:14:37,072 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1733352.0, ans=0.2 2023-06-27 05:14:37,583 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.37 vs. limit=15.0 2023-06-27 05:14:46,478 INFO [train.py:996] (0/4) Epoch 10, batch 14450, loss[loss=0.1838, simple_loss=0.25, pruned_loss=0.05879, over 21609.00 frames. ], tot_loss[loss=0.2095, simple_loss=0.2866, pruned_loss=0.06623, over 4256372.54 frames. ], batch size: 298, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 05:14:51,410 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.96 vs. limit=15.0 2023-06-27 05:15:20,073 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.66 vs. limit=15.0 2023-06-27 05:15:36,482 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.984e+02 5.618e+02 7.352e+02 1.088e+03 2.382e+03, threshold=1.470e+03, percent-clipped=1.0 2023-06-27 05:16:17,954 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1733652.0, ans=0.125 2023-06-27 05:16:18,595 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.32 vs. limit=15.0 2023-06-27 05:16:18,731 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.20 vs. limit=22.5 2023-06-27 05:16:28,067 INFO [train.py:996] (0/4) Epoch 10, batch 14500, loss[loss=0.1983, simple_loss=0.2773, pruned_loss=0.05964, over 21747.00 frames. ], tot_loss[loss=0.2078, simple_loss=0.2834, pruned_loss=0.06609, over 4262859.49 frames. ], batch size: 98, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 05:16:56,786 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1733772.0, ans=0.2 2023-06-27 05:17:00,337 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1733772.0, ans=0.125 2023-06-27 05:17:35,675 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.03 vs. limit=15.0 2023-06-27 05:17:36,804 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1733892.0, ans=0.125 2023-06-27 05:17:58,460 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1733952.0, ans=0.2 2023-06-27 05:18:00,877 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.20 vs. limit=22.5 2023-06-27 05:18:12,027 INFO [train.py:996] (0/4) Epoch 10, batch 14550, loss[loss=0.2846, simple_loss=0.3599, pruned_loss=0.1047, over 21238.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2864, pruned_loss=0.06719, over 4263046.53 frames. ], batch size: 143, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 05:18:14,285 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1734012.0, ans=0.1 2023-06-27 05:18:24,859 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1734012.0, ans=0.125 2023-06-27 05:18:36,010 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.15 vs. limit=15.0 2023-06-27 05:19:02,694 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.284e+02 5.674e+02 7.709e+02 1.144e+03 2.600e+03, threshold=1.542e+03, percent-clipped=15.0 2023-06-27 05:20:05,608 INFO [train.py:996] (0/4) Epoch 10, batch 14600, loss[loss=0.2294, simple_loss=0.319, pruned_loss=0.06996, over 21338.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.2961, pruned_loss=0.07147, over 4266017.61 frames. ], batch size: 176, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 05:20:35,607 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.85 vs. limit=6.0 2023-06-27 05:20:45,887 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1734432.0, ans=0.0 2023-06-27 05:20:58,013 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1734432.0, ans=0.125 2023-06-27 05:21:31,176 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1734552.0, ans=0.125 2023-06-27 05:21:48,289 INFO [train.py:996] (0/4) Epoch 10, batch 14650, loss[loss=0.1984, simple_loss=0.3018, pruned_loss=0.04755, over 19797.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.2983, pruned_loss=0.07108, over 4251991.85 frames. ], batch size: 702, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 05:22:26,594 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1734672.0, ans=0.0 2023-06-27 05:22:34,053 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=1734732.0, ans=15.0 2023-06-27 05:22:36,801 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1734732.0, ans=0.125 2023-06-27 05:22:39,586 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 2.998e+02 5.657e+02 7.781e+02 1.109e+03 2.213e+03, threshold=1.556e+03, percent-clipped=10.0 2023-06-27 05:22:59,492 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1734732.0, ans=0.2 2023-06-27 05:23:37,309 INFO [train.py:996] (0/4) Epoch 10, batch 14700, loss[loss=0.1891, simple_loss=0.2365, pruned_loss=0.07086, over 20747.00 frames. ], tot_loss[loss=0.2103, simple_loss=0.2904, pruned_loss=0.06515, over 4252113.76 frames. ], batch size: 608, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 05:23:38,045 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1734912.0, ans=0.1 2023-06-27 05:24:04,508 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1734972.0, ans=0.125 2023-06-27 05:25:38,807 INFO [train.py:996] (0/4) Epoch 10, batch 14750, loss[loss=0.3137, simple_loss=0.3785, pruned_loss=0.1245, over 21428.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.296, pruned_loss=0.0679, over 4257517.72 frames. ], batch size: 471, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 05:25:50,731 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1735212.0, ans=0.0 2023-06-27 05:26:12,806 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1735272.0, ans=0.2 2023-06-27 05:26:22,719 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1735272.0, ans=0.0 2023-06-27 05:26:30,561 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.686e+02 7.000e+02 1.273e+03 1.820e+03 3.687e+03, threshold=2.546e+03, percent-clipped=36.0 2023-06-27 05:26:31,351 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1735332.0, ans=0.04949747468305833 2023-06-27 05:26:44,235 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.28 vs. limit=15.0 2023-06-27 05:27:00,035 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.16 vs. limit=15.0 2023-06-27 05:27:06,414 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1735452.0, ans=0.0 2023-06-27 05:27:29,177 INFO [train.py:996] (0/4) Epoch 10, batch 14800, loss[loss=0.219, simple_loss=0.3073, pruned_loss=0.06536, over 21657.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.3083, pruned_loss=0.07379, over 4262142.65 frames. ], batch size: 298, lr: 2.94e-03, grad_scale: 32.0 2023-06-27 05:28:04,112 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1735572.0, ans=0.1 2023-06-27 05:28:09,898 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1735572.0, ans=0.2 2023-06-27 05:28:13,291 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1735572.0, ans=0.0 2023-06-27 05:29:07,093 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1735752.0, ans=0.1 2023-06-27 05:29:08,858 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1735752.0, ans=0.125 2023-06-27 05:29:20,840 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.54 vs. limit=15.0 2023-06-27 05:29:28,521 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1735812.0, ans=0.0 2023-06-27 05:29:28,597 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1735812.0, ans=0.125 2023-06-27 05:29:29,442 INFO [train.py:996] (0/4) Epoch 10, batch 14850, loss[loss=0.2289, simple_loss=0.3041, pruned_loss=0.0769, over 21883.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.3025, pruned_loss=0.07339, over 4262371.84 frames. ], batch size: 372, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 05:29:37,116 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1735812.0, ans=0.0 2023-06-27 05:29:56,875 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1735872.0, ans=0.125 2023-06-27 05:30:00,672 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1735872.0, ans=0.125 2023-06-27 05:30:16,838 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.100e+02 5.316e+02 7.277e+02 1.299e+03 3.940e+03, threshold=1.455e+03, percent-clipped=5.0 2023-06-27 05:31:16,537 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1736052.0, ans=0.0 2023-06-27 05:31:19,429 INFO [train.py:996] (0/4) Epoch 10, batch 14900, loss[loss=0.2351, simple_loss=0.2987, pruned_loss=0.08579, over 20038.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.304, pruned_loss=0.07425, over 4254668.82 frames. ], batch size: 703, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 05:31:37,200 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.22 vs. limit=15.0 2023-06-27 05:32:49,868 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1736292.0, ans=0.1 2023-06-27 05:32:59,802 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.99 vs. limit=12.0 2023-06-27 05:33:11,111 INFO [train.py:996] (0/4) Epoch 10, batch 14950, loss[loss=0.2276, simple_loss=0.314, pruned_loss=0.07065, over 21798.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.3059, pruned_loss=0.07416, over 4263221.81 frames. ], batch size: 282, lr: 2.94e-03, grad_scale: 8.0 2023-06-27 05:33:31,582 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.29 vs. limit=15.0 2023-06-27 05:33:34,715 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1736472.0, ans=0.125 2023-06-27 05:33:39,846 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1736472.0, ans=0.1 2023-06-27 05:33:52,329 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1736472.0, ans=0.0 2023-06-27 05:34:05,493 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.924e+02 5.785e+02 8.505e+02 1.255e+03 2.502e+03, threshold=1.701e+03, percent-clipped=18.0 2023-06-27 05:34:24,626 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.37 vs. limit=12.0 2023-06-27 05:34:39,661 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1736592.0, ans=0.0 2023-06-27 05:34:57,030 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1736652.0, ans=0.1 2023-06-27 05:35:00,005 INFO [train.py:996] (0/4) Epoch 10, batch 15000, loss[loss=0.2151, simple_loss=0.2859, pruned_loss=0.07214, over 21665.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3072, pruned_loss=0.07507, over 4272326.97 frames. ], batch size: 230, lr: 2.94e-03, grad_scale: 8.0 2023-06-27 05:35:00,007 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-27 05:35:10,693 INFO [zipformer.py:1728] (0/4) name=encoder.encoders.2.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([1.8889, 4.2550, 4.0630, 4.3194], device='cuda:0') 2023-06-27 05:35:19,882 INFO [train.py:1028] (0/4) Epoch 10, validation: loss=0.2554, simple_loss=0.3462, pruned_loss=0.08227, over 1796401.00 frames. 2023-06-27 05:35:19,884 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23714MB 2023-06-27 05:35:46,030 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1736772.0, ans=0.125 2023-06-27 05:37:04,894 INFO [train.py:996] (0/4) Epoch 10, batch 15050, loss[loss=0.2268, simple_loss=0.3228, pruned_loss=0.06543, over 19936.00 frames. ], tot_loss[loss=0.229, simple_loss=0.3073, pruned_loss=0.07537, over 4268773.06 frames. ], batch size: 702, lr: 2.94e-03, grad_scale: 8.0 2023-06-27 05:37:08,992 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1737012.0, ans=0.125 2023-06-27 05:37:20,112 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.18 vs. limit=15.0 2023-06-27 05:37:21,255 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1737012.0, ans=0.125 2023-06-27 05:37:23,530 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.58 vs. limit=12.0 2023-06-27 05:38:05,490 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.468e+02 6.013e+02 1.020e+03 1.764e+03 3.653e+03, threshold=2.041e+03, percent-clipped=28.0 2023-06-27 05:38:15,206 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.86 vs. limit=15.0 2023-06-27 05:38:20,518 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.61 vs. limit=12.0 2023-06-27 05:38:59,253 INFO [train.py:996] (0/4) Epoch 10, batch 15100, loss[loss=0.2367, simple_loss=0.3121, pruned_loss=0.08066, over 21588.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3109, pruned_loss=0.07581, over 4269746.68 frames. ], batch size: 263, lr: 2.94e-03, grad_scale: 8.0 2023-06-27 05:39:37,055 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1737372.0, ans=0.125 2023-06-27 05:39:47,405 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1737432.0, ans=0.0 2023-06-27 05:39:56,184 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1737432.0, ans=0.125 2023-06-27 05:40:24,285 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1737552.0, ans=0.2 2023-06-27 05:40:48,203 INFO [train.py:996] (0/4) Epoch 10, batch 15150, loss[loss=0.1995, simple_loss=0.2724, pruned_loss=0.06334, over 21421.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.3066, pruned_loss=0.07494, over 4269565.43 frames. ], batch size: 131, lr: 2.94e-03, grad_scale: 8.0 2023-06-27 05:41:42,598 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.241e+02 5.996e+02 8.329e+02 1.455e+03 4.229e+03, threshold=1.666e+03, percent-clipped=17.0 2023-06-27 05:41:50,383 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_na.min_abs, batch_count=1737792.0, ans=0.02 2023-06-27 05:42:28,804 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1737852.0, ans=0.0 2023-06-27 05:42:36,458 INFO [train.py:996] (0/4) Epoch 10, batch 15200, loss[loss=0.2407, simple_loss=0.3487, pruned_loss=0.06635, over 19779.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.2986, pruned_loss=0.07134, over 4260374.76 frames. ], batch size: 703, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 05:43:24,032 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.80 vs. limit=15.0 2023-06-27 05:43:28,886 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1738032.0, ans=0.125 2023-06-27 05:43:40,747 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1738092.0, ans=0.125 2023-06-27 05:44:16,490 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_na.min_abs, batch_count=1738152.0, ans=0.02 2023-06-27 05:44:22,673 INFO [train.py:996] (0/4) Epoch 10, batch 15250, loss[loss=0.2547, simple_loss=0.3044, pruned_loss=0.1025, over 21389.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2929, pruned_loss=0.06989, over 4262825.95 frames. ], batch size: 509, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 05:45:16,769 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.911e+02 6.076e+02 9.164e+02 1.527e+03 3.060e+03, threshold=1.833e+03, percent-clipped=16.0 2023-06-27 05:45:28,035 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1738392.0, ans=10.0 2023-06-27 05:45:43,799 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1738392.0, ans=0.0 2023-06-27 05:46:11,068 INFO [train.py:996] (0/4) Epoch 10, batch 15300, loss[loss=0.2701, simple_loss=0.326, pruned_loss=0.1071, over 21410.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.2962, pruned_loss=0.07259, over 4264206.22 frames. ], batch size: 471, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 05:46:27,589 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1738512.0, ans=0.125 2023-06-27 05:47:38,715 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1738692.0, ans=0.2 2023-06-27 05:47:57,782 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1738812.0, ans=0.2 2023-06-27 05:47:58,657 INFO [train.py:996] (0/4) Epoch 10, batch 15350, loss[loss=0.2367, simple_loss=0.3303, pruned_loss=0.07156, over 21631.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.3004, pruned_loss=0.07426, over 4267251.25 frames. ], batch size: 389, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 05:48:51,341 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.081e+02 6.656e+02 9.808e+02 1.431e+03 3.197e+03, threshold=1.962e+03, percent-clipped=8.0 2023-06-27 05:48:55,445 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1738932.0, ans=0.0 2023-06-27 05:49:29,870 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1739052.0, ans=0.1 2023-06-27 05:49:31,519 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1739052.0, ans=0.125 2023-06-27 05:49:34,848 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1739052.0, ans=0.1 2023-06-27 05:49:38,925 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.46 vs. limit=15.0 2023-06-27 05:49:40,633 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.70 vs. limit=15.0 2023-06-27 05:49:45,879 INFO [train.py:996] (0/4) Epoch 10, batch 15400, loss[loss=0.2016, simple_loss=0.286, pruned_loss=0.05857, over 21196.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.3016, pruned_loss=0.07233, over 4263661.31 frames. ], batch size: 143, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 05:50:36,626 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1739232.0, ans=0.125 2023-06-27 05:51:13,888 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1739292.0, ans=0.07 2023-06-27 05:51:21,477 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=5.95 vs. limit=12.0 2023-06-27 05:51:30,887 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=1739352.0, ans=0.5 2023-06-27 05:51:33,737 INFO [train.py:996] (0/4) Epoch 10, batch 15450, loss[loss=0.2055, simple_loss=0.3059, pruned_loss=0.05259, over 21823.00 frames. ], tot_loss[loss=0.222, simple_loss=0.3002, pruned_loss=0.07192, over 4262993.06 frames. ], batch size: 351, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 05:51:34,274 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1739412.0, ans=0.05 2023-06-27 05:51:49,498 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.75 vs. limit=10.0 2023-06-27 05:52:12,612 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1739472.0, ans=0.125 2023-06-27 05:52:28,022 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.164e+02 6.328e+02 9.249e+02 1.410e+03 2.980e+03, threshold=1.850e+03, percent-clipped=8.0 2023-06-27 05:52:32,018 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1739532.0, ans=0.2 2023-06-27 05:52:58,066 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.32 vs. limit=15.0 2023-06-27 05:52:59,535 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1739592.0, ans=0.125 2023-06-27 05:53:29,064 INFO [train.py:996] (0/4) Epoch 10, batch 15500, loss[loss=0.2286, simple_loss=0.3058, pruned_loss=0.07572, over 21838.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.3017, pruned_loss=0.07174, over 4241962.31 frames. ], batch size: 247, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 05:54:01,690 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1739772.0, ans=0.125 2023-06-27 05:54:48,562 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.80 vs. limit=10.0 2023-06-27 05:55:23,946 INFO [train.py:996] (0/4) Epoch 10, batch 15550, loss[loss=0.1731, simple_loss=0.2418, pruned_loss=0.05218, over 21841.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.3, pruned_loss=0.07054, over 4250471.78 frames. ], batch size: 98, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 05:56:09,345 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1740132.0, ans=0.2 2023-06-27 05:56:17,356 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.325e+02 6.960e+02 1.269e+03 1.845e+03 3.300e+03, threshold=2.538e+03, percent-clipped=23.0 2023-06-27 05:56:21,830 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1740132.0, ans=0.0 2023-06-27 05:56:30,465 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1740192.0, ans=0.07 2023-06-27 05:56:30,487 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1740192.0, ans=0.0 2023-06-27 05:56:39,984 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.58 vs. limit=22.5 2023-06-27 05:57:11,163 INFO [train.py:996] (0/4) Epoch 10, batch 15600, loss[loss=0.2051, simple_loss=0.2767, pruned_loss=0.06674, over 21404.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2933, pruned_loss=0.06864, over 4256190.79 frames. ], batch size: 389, lr: 2.93e-03, grad_scale: 32.0 2023-06-27 05:57:15,066 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1740312.0, ans=0.125 2023-06-27 05:57:17,223 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1740312.0, ans=0.0 2023-06-27 05:57:27,539 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1740372.0, ans=0.125 2023-06-27 05:58:22,311 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.66 vs. limit=10.0 2023-06-27 05:58:59,240 INFO [train.py:996] (0/4) Epoch 10, batch 15650, loss[loss=0.213, simple_loss=0.2841, pruned_loss=0.07092, over 20647.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2911, pruned_loss=0.06799, over 4244259.48 frames. ], batch size: 607, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 05:59:49,306 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.999e+02 5.253e+02 8.016e+02 1.068e+03 2.204e+03, threshold=1.603e+03, percent-clipped=0.0 2023-06-27 06:00:04,255 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.10 vs. limit=15.0 2023-06-27 06:00:41,567 INFO [train.py:996] (0/4) Epoch 10, batch 15700, loss[loss=0.2084, simple_loss=0.2715, pruned_loss=0.07269, over 21205.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.2872, pruned_loss=0.06695, over 4248785.00 frames. ], batch size: 159, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:01:24,207 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1740972.0, ans=0.1 2023-06-27 06:01:56,190 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.27 vs. limit=10.0 2023-06-27 06:02:05,641 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1741152.0, ans=0.125 2023-06-27 06:02:28,426 INFO [train.py:996] (0/4) Epoch 10, batch 15750, loss[loss=0.2014, simple_loss=0.2751, pruned_loss=0.06382, over 21738.00 frames. ], tot_loss[loss=0.2088, simple_loss=0.2843, pruned_loss=0.0666, over 4254624.01 frames. ], batch size: 351, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:03:22,680 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.061e+02 5.702e+02 8.249e+02 1.125e+03 2.008e+03, threshold=1.650e+03, percent-clipped=7.0 2023-06-27 06:03:46,649 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.94 vs. limit=22.5 2023-06-27 06:04:14,232 INFO [train.py:996] (0/4) Epoch 10, batch 15800, loss[loss=0.1893, simple_loss=0.2552, pruned_loss=0.06169, over 21319.00 frames. ], tot_loss[loss=0.2057, simple_loss=0.2792, pruned_loss=0.06611, over 4259039.65 frames. ], batch size: 144, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:04:35,425 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1741572.0, ans=0.125 2023-06-27 06:04:52,649 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1741632.0, ans=0.1 2023-06-27 06:04:52,755 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1741632.0, ans=0.125 2023-06-27 06:06:00,836 INFO [train.py:996] (0/4) Epoch 10, batch 15850, loss[loss=0.1808, simple_loss=0.2493, pruned_loss=0.05616, over 21570.00 frames. ], tot_loss[loss=0.2083, simple_loss=0.2812, pruned_loss=0.06767, over 4260529.23 frames. ], batch size: 230, lr: 2.93e-03, grad_scale: 8.0 2023-06-27 06:06:25,825 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.90 vs. limit=15.0 2023-06-27 06:06:39,050 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1741932.0, ans=0.125 2023-06-27 06:06:57,733 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.205e+02 6.570e+02 8.492e+02 1.187e+03 2.613e+03, threshold=1.698e+03, percent-clipped=9.0 2023-06-27 06:07:28,052 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1742052.0, ans=0.0 2023-06-27 06:07:42,862 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1742052.0, ans=0.035 2023-06-27 06:07:47,408 INFO [train.py:996] (0/4) Epoch 10, batch 15900, loss[loss=0.2219, simple_loss=0.2936, pruned_loss=0.07504, over 21249.00 frames. ], tot_loss[loss=0.2072, simple_loss=0.2791, pruned_loss=0.06765, over 4258866.99 frames. ], batch size: 143, lr: 2.93e-03, grad_scale: 8.0 2023-06-27 06:08:09,772 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1742172.0, ans=0.0 2023-06-27 06:08:38,531 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1742232.0, ans=0.95 2023-06-27 06:08:53,856 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1742292.0, ans=0.0 2023-06-27 06:09:33,390 INFO [train.py:996] (0/4) Epoch 10, batch 15950, loss[loss=0.167, simple_loss=0.253, pruned_loss=0.04051, over 21467.00 frames. ], tot_loss[loss=0.2063, simple_loss=0.2815, pruned_loss=0.06554, over 4249725.63 frames. ], batch size: 131, lr: 2.93e-03, grad_scale: 8.0 2023-06-27 06:10:31,667 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.565e+02 5.245e+02 8.616e+02 1.211e+03 4.191e+03, threshold=1.723e+03, percent-clipped=6.0 2023-06-27 06:11:01,334 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1742652.0, ans=0.0 2023-06-27 06:11:03,086 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1742652.0, ans=0.2 2023-06-27 06:11:21,903 INFO [train.py:996] (0/4) Epoch 10, batch 16000, loss[loss=0.2864, simple_loss=0.3602, pruned_loss=0.1063, over 21546.00 frames. ], tot_loss[loss=0.2061, simple_loss=0.2838, pruned_loss=0.0642, over 4256222.77 frames. ], batch size: 508, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:11:39,807 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1742772.0, ans=0.2 2023-06-27 06:12:17,874 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1742832.0, ans=0.0 2023-06-27 06:12:43,368 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1742952.0, ans=0.1 2023-06-27 06:13:00,864 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1742952.0, ans=0.0 2023-06-27 06:13:10,603 INFO [train.py:996] (0/4) Epoch 10, batch 16050, loss[loss=0.1838, simple_loss=0.2674, pruned_loss=0.05012, over 21466.00 frames. ], tot_loss[loss=0.2077, simple_loss=0.2877, pruned_loss=0.06382, over 4261693.84 frames. ], batch size: 211, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:13:14,624 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1743012.0, ans=0.0 2023-06-27 06:13:16,308 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1743012.0, ans=0.1 2023-06-27 06:13:38,544 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1743072.0, ans=0.025 2023-06-27 06:14:07,170 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.055e+02 6.829e+02 9.641e+02 1.432e+03 3.603e+03, threshold=1.928e+03, percent-clipped=16.0 2023-06-27 06:14:15,951 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1743192.0, ans=0.125 2023-06-27 06:14:24,668 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1743192.0, ans=0.125 2023-06-27 06:14:34,972 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1743252.0, ans=0.125 2023-06-27 06:14:50,599 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1743312.0, ans=0.05 2023-06-27 06:14:51,658 INFO [train.py:996] (0/4) Epoch 10, batch 16100, loss[loss=0.2313, simple_loss=0.3007, pruned_loss=0.08101, over 21319.00 frames. ], tot_loss[loss=0.2098, simple_loss=0.2907, pruned_loss=0.06444, over 4271867.71 frames. ], batch size: 143, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:14:52,379 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1743312.0, ans=0.125 2023-06-27 06:15:04,676 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.56 vs. limit=22.5 2023-06-27 06:15:19,115 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1743372.0, ans=0.125 2023-06-27 06:15:22,628 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1743372.0, ans=0.125 2023-06-27 06:16:27,675 INFO [train.py:996] (0/4) Epoch 10, batch 16150, loss[loss=0.2062, simple_loss=0.2866, pruned_loss=0.06293, over 21687.00 frames. ], tot_loss[loss=0.2107, simple_loss=0.289, pruned_loss=0.06623, over 4285934.82 frames. ], batch size: 230, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:16:49,952 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1743612.0, ans=0.1 2023-06-27 06:17:06,974 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1743672.0, ans=0.1 2023-06-27 06:17:36,691 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.965e+02 5.965e+02 7.575e+02 1.164e+03 3.405e+03, threshold=1.515e+03, percent-clipped=4.0 2023-06-27 06:17:44,153 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1743792.0, ans=0.125 2023-06-27 06:18:01,569 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1743852.0, ans=0.2 2023-06-27 06:18:22,949 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1743852.0, ans=0.125 2023-06-27 06:18:27,523 INFO [train.py:996] (0/4) Epoch 10, batch 16200, loss[loss=0.2594, simple_loss=0.3372, pruned_loss=0.09083, over 21461.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2929, pruned_loss=0.06698, over 4277818.84 frames. ], batch size: 131, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:18:45,529 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1743972.0, ans=0.1 2023-06-27 06:18:45,533 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1743972.0, ans=0.0 2023-06-27 06:19:40,977 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.71 vs. limit=22.5 2023-06-27 06:19:43,692 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1744092.0, ans=0.125 2023-06-27 06:20:13,820 INFO [train.py:996] (0/4) Epoch 10, batch 16250, loss[loss=0.1581, simple_loss=0.2638, pruned_loss=0.02626, over 19760.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.293, pruned_loss=0.06705, over 4272349.09 frames. ], batch size: 703, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:20:56,148 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1744272.0, ans=0.0 2023-06-27 06:21:10,958 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.705e+02 5.225e+02 6.820e+02 1.048e+03 2.777e+03, threshold=1.364e+03, percent-clipped=10.0 2023-06-27 06:21:14,810 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1744392.0, ans=0.125 2023-06-27 06:21:56,592 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.85 vs. limit=12.0 2023-06-27 06:22:00,278 INFO [train.py:996] (0/4) Epoch 10, batch 16300, loss[loss=0.1709, simple_loss=0.264, pruned_loss=0.03889, over 21844.00 frames. ], tot_loss[loss=0.2072, simple_loss=0.2878, pruned_loss=0.06332, over 4275684.25 frames. ], batch size: 317, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:23:33,960 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1744752.0, ans=0.0 2023-06-27 06:23:48,233 INFO [train.py:996] (0/4) Epoch 10, batch 16350, loss[loss=0.2209, simple_loss=0.2954, pruned_loss=0.07315, over 21707.00 frames. ], tot_loss[loss=0.2078, simple_loss=0.2877, pruned_loss=0.06395, over 4268493.87 frames. ], batch size: 298, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:23:54,276 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1744812.0, ans=0.2 2023-06-27 06:23:55,839 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1744812.0, ans=0.125 2023-06-27 06:24:45,630 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.899e+02 6.082e+02 8.252e+02 1.130e+03 2.497e+03, threshold=1.650e+03, percent-clipped=10.0 2023-06-27 06:24:47,981 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1744932.0, ans=0.1 2023-06-27 06:25:03,888 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1744992.0, ans=0.04949747468305833 2023-06-27 06:25:11,125 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1744992.0, ans=0.0 2023-06-27 06:25:35,471 INFO [train.py:996] (0/4) Epoch 10, batch 16400, loss[loss=0.2239, simple_loss=0.2937, pruned_loss=0.07701, over 21320.00 frames. ], tot_loss[loss=0.2121, simple_loss=0.2926, pruned_loss=0.06579, over 4268679.53 frames. ], batch size: 176, lr: 2.93e-03, grad_scale: 32.0 2023-06-27 06:25:47,369 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.67 vs. limit=15.0 2023-06-27 06:26:02,411 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1745172.0, ans=0.0 2023-06-27 06:26:08,708 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1745172.0, ans=0.1 2023-06-27 06:27:22,341 INFO [train.py:996] (0/4) Epoch 10, batch 16450, loss[loss=0.188, simple_loss=0.263, pruned_loss=0.05645, over 21524.00 frames. ], tot_loss[loss=0.2133, simple_loss=0.2937, pruned_loss=0.06646, over 4273271.19 frames. ], batch size: 211, lr: 2.93e-03, grad_scale: 32.0 2023-06-27 06:28:19,537 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.272e+02 6.597e+02 9.235e+02 1.601e+03 3.322e+03, threshold=1.847e+03, percent-clipped=22.0 2023-06-27 06:28:32,498 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1745592.0, ans=0.2 2023-06-27 06:28:35,992 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1745592.0, ans=0.04949747468305833 2023-06-27 06:29:15,240 INFO [train.py:996] (0/4) Epoch 10, batch 16500, loss[loss=0.1765, simple_loss=0.2447, pruned_loss=0.05411, over 21371.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2919, pruned_loss=0.06689, over 4278963.49 frames. ], batch size: 194, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:29:37,272 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1745772.0, ans=0.125 2023-06-27 06:29:45,083 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.76 vs. limit=12.0 2023-06-27 06:30:30,733 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=1745892.0, ans=15.0 2023-06-27 06:31:10,034 INFO [train.py:996] (0/4) Epoch 10, batch 16550, loss[loss=0.2031, simple_loss=0.2739, pruned_loss=0.06619, over 20099.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.2917, pruned_loss=0.06588, over 4274673.06 frames. ], batch size: 702, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:32:11,787 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.278e+02 6.354e+02 1.023e+03 1.715e+03 3.969e+03, threshold=2.045e+03, percent-clipped=20.0 2023-06-27 06:32:34,802 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.47 vs. limit=15.0 2023-06-27 06:32:43,262 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1746252.0, ans=0.2 2023-06-27 06:33:00,857 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1746312.0, ans=0.0 2023-06-27 06:33:01,733 INFO [train.py:996] (0/4) Epoch 10, batch 16600, loss[loss=0.2728, simple_loss=0.3764, pruned_loss=0.08454, over 21714.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.2971, pruned_loss=0.06784, over 4272858.91 frames. ], batch size: 351, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:33:02,801 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1746312.0, ans=0.125 2023-06-27 06:33:09,900 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1746312.0, ans=0.0 2023-06-27 06:34:11,845 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1746492.0, ans=0.04949747468305833 2023-06-27 06:34:32,042 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.00 vs. limit=6.0 2023-06-27 06:34:44,385 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1746552.0, ans=0.125 2023-06-27 06:34:50,954 INFO [train.py:996] (0/4) Epoch 10, batch 16650, loss[loss=0.2464, simple_loss=0.3331, pruned_loss=0.07984, over 21835.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.3063, pruned_loss=0.07076, over 4276151.80 frames. ], batch size: 118, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:35:35,428 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1746672.0, ans=0.125 2023-06-27 06:35:58,143 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.748e+02 7.097e+02 9.518e+02 1.581e+03 3.619e+03, threshold=1.904e+03, percent-clipped=14.0 2023-06-27 06:36:42,267 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1746852.0, ans=10.0 2023-06-27 06:36:48,642 INFO [train.py:996] (0/4) Epoch 10, batch 16700, loss[loss=0.1858, simple_loss=0.2459, pruned_loss=0.0628, over 21456.00 frames. ], tot_loss[loss=0.226, simple_loss=0.3079, pruned_loss=0.07206, over 4263308.04 frames. ], batch size: 194, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:36:49,279 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1746912.0, ans=0.1 2023-06-27 06:37:15,513 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1746972.0, ans=0.125 2023-06-27 06:37:50,018 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1747032.0, ans=0.125 2023-06-27 06:38:46,421 INFO [train.py:996] (0/4) Epoch 10, batch 16750, loss[loss=0.2788, simple_loss=0.3721, pruned_loss=0.09271, over 21701.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.3108, pruned_loss=0.07447, over 4268127.58 frames. ], batch size: 441, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:38:49,715 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.26 vs. limit=22.5 2023-06-27 06:39:09,792 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1747272.0, ans=0.1 2023-06-27 06:39:18,835 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1747272.0, ans=0.125 2023-06-27 06:39:48,732 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1747332.0, ans=0.09899494936611666 2023-06-27 06:39:53,244 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.731e+02 7.125e+02 1.124e+03 1.580e+03 3.763e+03, threshold=2.248e+03, percent-clipped=17.0 2023-06-27 06:40:40,776 INFO [train.py:996] (0/4) Epoch 10, batch 16800, loss[loss=0.2173, simple_loss=0.2864, pruned_loss=0.07408, over 21848.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3116, pruned_loss=0.07347, over 4267012.65 frames. ], batch size: 118, lr: 2.93e-03, grad_scale: 32.0 2023-06-27 06:41:07,093 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1747572.0, ans=0.0 2023-06-27 06:41:07,127 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1747572.0, ans=0.0 2023-06-27 06:41:44,847 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1747692.0, ans=0.125 2023-06-27 06:41:59,163 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.82 vs. limit=15.0 2023-06-27 06:42:22,102 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1747752.0, ans=0.125 2023-06-27 06:42:26,708 INFO [train.py:996] (0/4) Epoch 10, batch 16850, loss[loss=0.2101, simple_loss=0.2876, pruned_loss=0.06634, over 21509.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.3074, pruned_loss=0.07365, over 4273171.39 frames. ], batch size: 131, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:43:16,322 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1747932.0, ans=0.125 2023-06-27 06:43:27,407 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.310e+02 6.690e+02 9.145e+02 1.519e+03 3.869e+03, threshold=1.829e+03, percent-clipped=12.0 2023-06-27 06:43:40,079 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1747992.0, ans=0.125 2023-06-27 06:44:11,627 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1748112.0, ans=0.0 2023-06-27 06:44:12,641 INFO [train.py:996] (0/4) Epoch 10, batch 16900, loss[loss=0.2419, simple_loss=0.3552, pruned_loss=0.0643, over 20732.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.3024, pruned_loss=0.07207, over 4272962.93 frames. ], batch size: 607, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:44:30,236 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=1748172.0, ans=0.05 2023-06-27 06:45:02,886 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1748232.0, ans=0.125 2023-06-27 06:45:59,640 INFO [train.py:996] (0/4) Epoch 10, batch 16950, loss[loss=0.2324, simple_loss=0.2898, pruned_loss=0.08754, over 21761.00 frames. ], tot_loss[loss=0.2181, simple_loss=0.2955, pruned_loss=0.07037, over 4279224.15 frames. ], batch size: 508, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:46:02,005 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1748412.0, ans=0.125 2023-06-27 06:46:24,337 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1748472.0, ans=0.125 2023-06-27 06:47:00,163 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.671e+02 6.123e+02 1.009e+03 1.392e+03 3.065e+03, threshold=2.019e+03, percent-clipped=11.0 2023-06-27 06:47:40,407 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1748652.0, ans=0.125 2023-06-27 06:47:47,011 INFO [train.py:996] (0/4) Epoch 10, batch 17000, loss[loss=0.2193, simple_loss=0.2777, pruned_loss=0.08043, over 21634.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2936, pruned_loss=0.07026, over 4283060.40 frames. ], batch size: 548, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:48:15,609 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1748772.0, ans=0.2 2023-06-27 06:48:23,212 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 06:48:26,733 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1748772.0, ans=0.125 2023-06-27 06:49:16,731 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1748952.0, ans=0.95 2023-06-27 06:49:35,366 INFO [train.py:996] (0/4) Epoch 10, batch 17050, loss[loss=0.273, simple_loss=0.345, pruned_loss=0.1005, over 21581.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.3015, pruned_loss=0.07197, over 4290431.68 frames. ], batch size: 471, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:49:41,143 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1749012.0, ans=0.1 2023-06-27 06:49:46,568 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1749012.0, ans=0.1 2023-06-27 06:50:34,069 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1749132.0, ans=0.125 2023-06-27 06:50:39,793 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.381e+02 7.817e+02 1.217e+03 1.816e+03 4.089e+03, threshold=2.434e+03, percent-clipped=19.0 2023-06-27 06:50:42,136 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1749192.0, ans=0.1 2023-06-27 06:50:47,037 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1749192.0, ans=0.2 2023-06-27 06:51:11,153 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1749252.0, ans=0.2 2023-06-27 06:51:20,994 INFO [train.py:996] (0/4) Epoch 10, batch 17100, loss[loss=0.2017, simple_loss=0.2753, pruned_loss=0.06408, over 21437.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.3005, pruned_loss=0.07264, over 4295082.47 frames. ], batch size: 211, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:51:35,551 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1749312.0, ans=0.0 2023-06-27 06:51:49,977 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.38 vs. limit=6.0 2023-06-27 06:52:12,171 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1749432.0, ans=0.0 2023-06-27 06:52:12,265 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1749432.0, ans=0.125 2023-06-27 06:52:23,796 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1749432.0, ans=0.125 2023-06-27 06:52:32,728 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1749492.0, ans=0.0 2023-06-27 06:52:48,014 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1749492.0, ans=0.125 2023-06-27 06:52:49,916 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 06:53:07,958 INFO [train.py:996] (0/4) Epoch 10, batch 17150, loss[loss=0.1772, simple_loss=0.2558, pruned_loss=0.04928, over 21257.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.2961, pruned_loss=0.07149, over 4293828.22 frames. ], batch size: 176, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:53:15,267 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1749612.0, ans=0.2 2023-06-27 06:54:10,034 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1749732.0, ans=0.1 2023-06-27 06:54:16,435 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.998e+02 6.290e+02 9.886e+02 1.236e+03 2.278e+03, threshold=1.977e+03, percent-clipped=0.0 2023-06-27 06:54:29,372 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1749792.0, ans=0.0 2023-06-27 06:55:01,794 INFO [train.py:996] (0/4) Epoch 10, batch 17200, loss[loss=0.2411, simple_loss=0.3139, pruned_loss=0.08409, over 21356.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.2959, pruned_loss=0.07174, over 4291762.85 frames. ], batch size: 159, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:55:06,337 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1749912.0, ans=0.1 2023-06-27 06:55:48,953 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1750032.0, ans=0.0 2023-06-27 06:56:56,961 INFO [train.py:996] (0/4) Epoch 10, batch 17250, loss[loss=0.2, simple_loss=0.289, pruned_loss=0.05551, over 21850.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.2998, pruned_loss=0.07439, over 4290424.73 frames. ], batch size: 282, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:57:00,038 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.66 vs. limit=22.5 2023-06-27 06:57:05,102 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.38 vs. limit=15.0 2023-06-27 06:58:00,060 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.260e+02 7.026e+02 1.059e+03 1.492e+03 2.502e+03, threshold=2.118e+03, percent-clipped=5.0 2023-06-27 06:58:14,395 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1750392.0, ans=0.1 2023-06-27 06:58:50,681 INFO [train.py:996] (0/4) Epoch 10, batch 17300, loss[loss=0.2469, simple_loss=0.3505, pruned_loss=0.07163, over 20928.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3069, pruned_loss=0.0769, over 4286781.50 frames. ], batch size: 607, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:00:17,866 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1750752.0, ans=0.0 2023-06-27 07:00:31,883 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1750752.0, ans=0.2 2023-06-27 07:00:35,528 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1750752.0, ans=0.0 2023-06-27 07:00:39,983 INFO [train.py:996] (0/4) Epoch 10, batch 17350, loss[loss=0.1932, simple_loss=0.27, pruned_loss=0.05821, over 21274.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.308, pruned_loss=0.07642, over 4283450.98 frames. ], batch size: 159, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:01:16,176 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1750872.0, ans=0.1 2023-06-27 07:01:22,113 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.07 vs. limit=10.0 2023-06-27 07:01:43,500 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.572e+02 6.288e+02 8.971e+02 1.269e+03 2.386e+03, threshold=1.794e+03, percent-clipped=3.0 2023-06-27 07:02:13,231 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1751052.0, ans=0.0 2023-06-27 07:02:35,912 INFO [train.py:996] (0/4) Epoch 10, batch 17400, loss[loss=0.2536, simple_loss=0.3355, pruned_loss=0.08582, over 21606.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.3051, pruned_loss=0.07315, over 4272794.23 frames. ], batch size: 441, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:02:56,424 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1751172.0, ans=0.125 2023-06-27 07:03:01,345 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1751172.0, ans=0.125 2023-06-27 07:03:30,197 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.21 vs. limit=15.0 2023-06-27 07:03:31,318 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1751232.0, ans=0.125 2023-06-27 07:03:31,380 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1751232.0, ans=0.0 2023-06-27 07:03:44,949 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1751292.0, ans=0.0 2023-06-27 07:04:23,545 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1751412.0, ans=0.125 2023-06-27 07:04:24,544 INFO [train.py:996] (0/4) Epoch 10, batch 17450, loss[loss=0.192, simple_loss=0.2859, pruned_loss=0.04909, over 21687.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.3019, pruned_loss=0.0717, over 4266871.14 frames. ], batch size: 298, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:04:35,504 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1751412.0, ans=0.1 2023-06-27 07:04:42,786 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.14 vs. limit=12.0 2023-06-27 07:05:31,265 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.94 vs. limit=22.5 2023-06-27 07:05:31,486 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.811e+02 5.744e+02 7.670e+02 1.157e+03 3.080e+03, threshold=1.534e+03, percent-clipped=10.0 2023-06-27 07:06:02,032 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1751652.0, ans=0.0 2023-06-27 07:06:07,314 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1751652.0, ans=0.0 2023-06-27 07:06:11,705 INFO [train.py:996] (0/4) Epoch 10, batch 17500, loss[loss=0.2504, simple_loss=0.3052, pruned_loss=0.09782, over 21698.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.2977, pruned_loss=0.07003, over 4277327.25 frames. ], batch size: 473, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:06:43,433 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1751772.0, ans=0.125 2023-06-27 07:07:02,183 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1751832.0, ans=0.0 2023-06-27 07:07:16,301 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1751832.0, ans=0.0 2023-06-27 07:07:22,741 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1751892.0, ans=0.125 2023-06-27 07:07:24,699 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.13 vs. limit=15.0 2023-06-27 07:07:27,328 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1751892.0, ans=0.125 2023-06-27 07:07:51,545 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.26 vs. limit=15.0 2023-06-27 07:07:52,421 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-292000.pt 2023-06-27 07:07:59,011 INFO [train.py:996] (0/4) Epoch 10, batch 17550, loss[loss=0.2181, simple_loss=0.3071, pruned_loss=0.06454, over 21867.00 frames. ], tot_loss[loss=0.219, simple_loss=0.2992, pruned_loss=0.06942, over 4277675.72 frames. ], batch size: 107, lr: 2.92e-03, grad_scale: 8.0 2023-06-27 07:08:44,868 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1752132.0, ans=0.125 2023-06-27 07:08:49,998 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1752132.0, ans=0.125 2023-06-27 07:09:08,422 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.979e+02 5.513e+02 7.220e+02 1.144e+03 2.854e+03, threshold=1.444e+03, percent-clipped=10.0 2023-06-27 07:09:48,126 INFO [train.py:996] (0/4) Epoch 10, batch 17600, loss[loss=0.2401, simple_loss=0.329, pruned_loss=0.07559, over 21742.00 frames. ], tot_loss[loss=0.22, simple_loss=0.3009, pruned_loss=0.06955, over 4273352.41 frames. ], batch size: 124, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:10:19,080 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1752372.0, ans=0.125 2023-06-27 07:10:32,953 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1752432.0, ans=0.125 2023-06-27 07:11:16,124 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1752492.0, ans=0.0 2023-06-27 07:11:16,744 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.20 vs. limit=12.0 2023-06-27 07:11:17,874 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1752552.0, ans=0.0 2023-06-27 07:11:22,179 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.76 vs. limit=15.0 2023-06-27 07:11:28,352 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1752552.0, ans=0.1 2023-06-27 07:11:36,232 INFO [train.py:996] (0/4) Epoch 10, batch 17650, loss[loss=0.1916, simple_loss=0.2739, pruned_loss=0.05467, over 21728.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2988, pruned_loss=0.06904, over 4268345.14 frames. ], batch size: 391, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:11:39,225 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.36 vs. limit=15.0 2023-06-27 07:11:59,426 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=1752612.0, ans=10.0 2023-06-27 07:12:11,427 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1752672.0, ans=0.125 2023-06-27 07:12:45,227 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1752792.0, ans=0.04949747468305833 2023-06-27 07:12:51,344 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.366e+02 6.949e+02 1.125e+03 1.736e+03 3.581e+03, threshold=2.249e+03, percent-clipped=33.0 2023-06-27 07:12:53,444 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1752792.0, ans=10.0 2023-06-27 07:13:08,276 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.83 vs. limit=15.0 2023-06-27 07:13:11,775 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1752852.0, ans=0.1 2023-06-27 07:13:22,084 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1752852.0, ans=0.125 2023-06-27 07:13:30,134 INFO [train.py:996] (0/4) Epoch 10, batch 17700, loss[loss=0.1975, simple_loss=0.291, pruned_loss=0.05204, over 21410.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2949, pruned_loss=0.06696, over 4270408.77 frames. ], batch size: 194, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:13:52,567 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1752972.0, ans=0.125 2023-06-27 07:13:53,202 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.15 vs. limit=15.0 2023-06-27 07:14:20,239 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1753032.0, ans=0.125 2023-06-27 07:14:36,262 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1753032.0, ans=0.0 2023-06-27 07:14:53,911 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_na.min_abs, batch_count=1753092.0, ans=0.02 2023-06-27 07:15:25,777 INFO [train.py:996] (0/4) Epoch 10, batch 17750, loss[loss=0.2527, simple_loss=0.3343, pruned_loss=0.08562, over 21602.00 frames. ], tot_loss[loss=0.219, simple_loss=0.3003, pruned_loss=0.06888, over 4265590.53 frames. ], batch size: 389, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:15:26,440 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1753212.0, ans=0.035 2023-06-27 07:15:26,544 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1753212.0, ans=0.125 2023-06-27 07:15:49,580 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1753272.0, ans=0.07 2023-06-27 07:16:16,566 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.14 vs. limit=15.0 2023-06-27 07:16:31,248 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.520e+02 6.307e+02 8.574e+02 1.258e+03 1.929e+03, threshold=1.715e+03, percent-clipped=0.0 2023-06-27 07:16:35,422 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1753392.0, ans=0.1 2023-06-27 07:17:15,948 INFO [train.py:996] (0/4) Epoch 10, batch 17800, loss[loss=0.2229, simple_loss=0.3074, pruned_loss=0.06921, over 21310.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2993, pruned_loss=0.06799, over 4259531.12 frames. ], batch size: 549, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:18:54,863 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1753752.0, ans=0.125 2023-06-27 07:19:09,868 INFO [train.py:996] (0/4) Epoch 10, batch 17850, loss[loss=0.2441, simple_loss=0.322, pruned_loss=0.08312, over 21677.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.3008, pruned_loss=0.0688, over 4257083.75 frames. ], batch size: 351, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:19:41,713 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1753872.0, ans=0.125 2023-06-27 07:20:02,228 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1753932.0, ans=0.125 2023-06-27 07:20:19,508 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.075e+02 5.767e+02 7.778e+02 1.051e+03 2.491e+03, threshold=1.556e+03, percent-clipped=2.0 2023-06-27 07:20:25,347 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 07:20:59,129 INFO [train.py:996] (0/4) Epoch 10, batch 17900, loss[loss=0.1793, simple_loss=0.2519, pruned_loss=0.05339, over 21859.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.3045, pruned_loss=0.06993, over 4265625.04 frames. ], batch size: 107, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:21:30,545 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.28 vs. limit=22.5 2023-06-27 07:21:56,501 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1754232.0, ans=0.125 2023-06-27 07:22:54,515 INFO [train.py:996] (0/4) Epoch 10, batch 17950, loss[loss=0.1802, simple_loss=0.2566, pruned_loss=0.05193, over 21857.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.3041, pruned_loss=0.06723, over 4263923.70 frames. ], batch size: 98, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:23:51,425 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1754532.0, ans=0.125 2023-06-27 07:23:55,589 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.13 vs. limit=15.0 2023-06-27 07:23:57,655 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.559e+02 6.955e+02 1.067e+03 1.323e+03 3.422e+03, threshold=2.134e+03, percent-clipped=13.0 2023-06-27 07:23:58,885 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.69 vs. limit=22.5 2023-06-27 07:24:31,349 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1754652.0, ans=0.125 2023-06-27 07:24:32,002 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.71 vs. limit=15.0 2023-06-27 07:24:41,311 INFO [train.py:996] (0/4) Epoch 10, batch 18000, loss[loss=0.1841, simple_loss=0.2577, pruned_loss=0.05527, over 21610.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2969, pruned_loss=0.06557, over 4261215.74 frames. ], batch size: 248, lr: 2.92e-03, grad_scale: 32.0 2023-06-27 07:24:41,312 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-27 07:24:59,825 INFO [train.py:1028] (0/4) Epoch 10, validation: loss=0.2583, simple_loss=0.3514, pruned_loss=0.08255, over 1796401.00 frames. 2023-06-27 07:24:59,826 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23714MB 2023-06-27 07:25:28,948 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1754772.0, ans=0.125 2023-06-27 07:26:32,740 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1754952.0, ans=0.0 2023-06-27 07:26:45,144 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1754952.0, ans=0.125 2023-06-27 07:26:48,118 INFO [train.py:996] (0/4) Epoch 10, batch 18050, loss[loss=0.2052, simple_loss=0.2731, pruned_loss=0.06861, over 21915.00 frames. ], tot_loss[loss=0.211, simple_loss=0.292, pruned_loss=0.06501, over 4263996.23 frames. ], batch size: 373, lr: 2.92e-03, grad_scale: 32.0 2023-06-27 07:26:52,602 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1755012.0, ans=0.025 2023-06-27 07:27:19,072 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1755072.0, ans=0.2 2023-06-27 07:27:31,501 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1755072.0, ans=0.04949747468305833 2023-06-27 07:27:32,115 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.44 vs. limit=22.5 2023-06-27 07:28:06,132 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.888e+02 5.313e+02 7.169e+02 9.498e+02 2.481e+03, threshold=1.434e+03, percent-clipped=3.0 2023-06-27 07:28:29,058 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1755252.0, ans=0.1 2023-06-27 07:28:37,174 INFO [train.py:996] (0/4) Epoch 10, batch 18100, loss[loss=0.2459, simple_loss=0.3294, pruned_loss=0.08122, over 21606.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2964, pruned_loss=0.06701, over 4260724.48 frames. ], batch size: 414, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:28:55,765 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1755312.0, ans=0.0 2023-06-27 07:29:03,089 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.73 vs. limit=15.0 2023-06-27 07:30:24,624 INFO [train.py:996] (0/4) Epoch 10, batch 18150, loss[loss=0.2211, simple_loss=0.2921, pruned_loss=0.07508, over 21571.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2981, pruned_loss=0.06609, over 4261888.62 frames. ], batch size: 391, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:30:48,654 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1755612.0, ans=0.0 2023-06-27 07:31:04,337 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1755672.0, ans=0.5 2023-06-27 07:31:15,065 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1755732.0, ans=0.125 2023-06-27 07:31:20,348 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1755732.0, ans=0.125 2023-06-27 07:31:23,604 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1755732.0, ans=0.0 2023-06-27 07:31:42,083 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.282e+02 6.043e+02 8.888e+02 1.339e+03 2.734e+03, threshold=1.778e+03, percent-clipped=20.0 2023-06-27 07:31:49,525 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1755792.0, ans=0.125 2023-06-27 07:32:11,837 INFO [train.py:996] (0/4) Epoch 10, batch 18200, loss[loss=0.199, simple_loss=0.2709, pruned_loss=0.06353, over 21809.00 frames. ], tot_loss[loss=0.2129, simple_loss=0.2935, pruned_loss=0.06614, over 4267881.31 frames. ], batch size: 352, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:32:39,953 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1755972.0, ans=0.125 2023-06-27 07:33:57,128 INFO [train.py:996] (0/4) Epoch 10, batch 18250, loss[loss=0.2079, simple_loss=0.2746, pruned_loss=0.07061, over 21568.00 frames. ], tot_loss[loss=0.2061, simple_loss=0.2848, pruned_loss=0.06367, over 4272316.67 frames. ], batch size: 548, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:34:08,041 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1756212.0, ans=0.125 2023-06-27 07:34:29,155 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1756272.0, ans=0.2 2023-06-27 07:34:43,885 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1756332.0, ans=0.125 2023-06-27 07:35:06,624 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.839e+02 5.360e+02 7.214e+02 1.131e+03 2.943e+03, threshold=1.443e+03, percent-clipped=6.0 2023-06-27 07:35:35,655 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1756452.0, ans=0.1 2023-06-27 07:35:37,301 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1756452.0, ans=0.1 2023-06-27 07:35:41,608 INFO [train.py:996] (0/4) Epoch 10, batch 18300, loss[loss=0.1998, simple_loss=0.2725, pruned_loss=0.06356, over 21171.00 frames. ], tot_loss[loss=0.2069, simple_loss=0.2857, pruned_loss=0.06403, over 4274712.24 frames. ], batch size: 159, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:35:43,621 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1756512.0, ans=0.125 2023-06-27 07:35:59,128 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1756572.0, ans=0.125 2023-06-27 07:36:33,604 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 07:36:50,957 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.41 vs. limit=15.0 2023-06-27 07:37:27,284 INFO [train.py:996] (0/4) Epoch 10, batch 18350, loss[loss=0.2054, simple_loss=0.2941, pruned_loss=0.05834, over 21700.00 frames. ], tot_loss[loss=0.211, simple_loss=0.2935, pruned_loss=0.06426, over 4263086.15 frames. ], batch size: 247, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:37:42,521 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.97 vs. limit=22.5 2023-06-27 07:38:02,764 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1756872.0, ans=0.125 2023-06-27 07:38:39,351 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.953e+02 5.887e+02 8.763e+02 1.316e+03 3.037e+03, threshold=1.753e+03, percent-clipped=16.0 2023-06-27 07:39:08,588 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1757052.0, ans=0.1 2023-06-27 07:39:10,292 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1757052.0, ans=0.0 2023-06-27 07:39:16,531 INFO [train.py:996] (0/4) Epoch 10, batch 18400, loss[loss=0.1723, simple_loss=0.2524, pruned_loss=0.04606, over 21558.00 frames. ], tot_loss[loss=0.2071, simple_loss=0.2883, pruned_loss=0.06301, over 4263553.24 frames. ], batch size: 263, lr: 2.92e-03, grad_scale: 32.0 2023-06-27 07:39:45,687 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1757172.0, ans=0.2 2023-06-27 07:39:51,755 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.16 vs. limit=6.0 2023-06-27 07:39:58,260 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1757172.0, ans=0.2 2023-06-27 07:40:03,269 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1757232.0, ans=0.0 2023-06-27 07:40:03,726 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.95 vs. limit=15.0 2023-06-27 07:40:34,476 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1757292.0, ans=0.0 2023-06-27 07:40:44,596 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1757292.0, ans=0.0 2023-06-27 07:41:04,184 INFO [train.py:996] (0/4) Epoch 10, batch 18450, loss[loss=0.1568, simple_loss=0.2472, pruned_loss=0.03318, over 21681.00 frames. ], tot_loss[loss=0.2016, simple_loss=0.2833, pruned_loss=0.0599, over 4272469.80 frames. ], batch size: 298, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:42:06,191 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1757532.0, ans=0.0 2023-06-27 07:42:17,468 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.327e+02 4.825e+02 6.029e+02 8.495e+02 1.994e+03, threshold=1.206e+03, percent-clipped=1.0 2023-06-27 07:42:42,164 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1757652.0, ans=0.125 2023-06-27 07:42:42,199 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1757652.0, ans=0.125 2023-06-27 07:42:50,201 INFO [train.py:996] (0/4) Epoch 10, batch 18500, loss[loss=0.1842, simple_loss=0.2552, pruned_loss=0.05654, over 21793.00 frames. ], tot_loss[loss=0.1987, simple_loss=0.2786, pruned_loss=0.05946, over 4272924.20 frames. ], batch size: 352, lr: 2.92e-03, grad_scale: 8.0 2023-06-27 07:43:56,208 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 07:44:37,148 INFO [train.py:996] (0/4) Epoch 10, batch 18550, loss[loss=0.1744, simple_loss=0.247, pruned_loss=0.0509, over 15921.00 frames. ], tot_loss[loss=0.1971, simple_loss=0.2762, pruned_loss=0.05895, over 4264909.86 frames. ], batch size: 60, lr: 2.92e-03, grad_scale: 8.0 2023-06-27 07:45:20,176 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1758072.0, ans=0.0 2023-06-27 07:45:37,063 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1758132.0, ans=0.125 2023-06-27 07:45:57,523 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.953e+02 6.338e+02 9.700e+02 1.484e+03 3.316e+03, threshold=1.940e+03, percent-clipped=34.0 2023-06-27 07:46:24,946 INFO [train.py:996] (0/4) Epoch 10, batch 18600, loss[loss=0.2066, simple_loss=0.2769, pruned_loss=0.06819, over 21636.00 frames. ], tot_loss[loss=0.1974, simple_loss=0.275, pruned_loss=0.05985, over 4259785.52 frames. ], batch size: 415, lr: 2.92e-03, grad_scale: 8.0 2023-06-27 07:46:28,781 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1758312.0, ans=0.5 2023-06-27 07:47:40,058 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.58 vs. limit=15.0 2023-06-27 07:48:09,078 INFO [train.py:996] (0/4) Epoch 10, batch 18650, loss[loss=0.1985, simple_loss=0.2693, pruned_loss=0.0639, over 21703.00 frames. ], tot_loss[loss=0.1984, simple_loss=0.2756, pruned_loss=0.06067, over 4262670.63 frames. ], batch size: 333, lr: 2.92e-03, grad_scale: 8.0 2023-06-27 07:49:15,890 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.00 vs. limit=15.0 2023-06-27 07:49:21,081 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.715e+02 5.496e+02 8.127e+02 1.461e+03 3.115e+03, threshold=1.625e+03, percent-clipped=10.0 2023-06-27 07:49:37,258 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.24 vs. limit=15.0 2023-06-27 07:49:53,355 INFO [train.py:996] (0/4) Epoch 10, batch 18700, loss[loss=0.2075, simple_loss=0.2874, pruned_loss=0.06385, over 21864.00 frames. ], tot_loss[loss=0.1978, simple_loss=0.2737, pruned_loss=0.06096, over 4262138.46 frames. ], batch size: 118, lr: 2.92e-03, grad_scale: 8.0 2023-06-27 07:50:27,006 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1758972.0, ans=0.1 2023-06-27 07:51:03,599 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.99 vs. limit=15.0 2023-06-27 07:51:39,702 INFO [train.py:996] (0/4) Epoch 10, batch 18750, loss[loss=0.2062, simple_loss=0.282, pruned_loss=0.06521, over 21251.00 frames. ], tot_loss[loss=0.1993, simple_loss=0.2744, pruned_loss=0.06215, over 4263530.72 frames. ], batch size: 176, lr: 2.92e-03, grad_scale: 8.0 2023-06-27 07:51:57,755 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1759272.0, ans=0.0 2023-06-27 07:52:18,499 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.04 vs. limit=15.0 2023-06-27 07:52:28,164 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1759332.0, ans=0.125 2023-06-27 07:52:52,808 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.039e+02 6.322e+02 1.036e+03 1.574e+03 2.810e+03, threshold=2.072e+03, percent-clipped=23.0 2023-06-27 07:52:57,492 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.68 vs. limit=15.0 2023-06-27 07:53:25,205 INFO [train.py:996] (0/4) Epoch 10, batch 18800, loss[loss=0.1669, simple_loss=0.2558, pruned_loss=0.03898, over 21757.00 frames. ], tot_loss[loss=0.2033, simple_loss=0.2797, pruned_loss=0.06346, over 4265027.34 frames. ], batch size: 247, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:54:19,001 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.60 vs. limit=15.0 2023-06-27 07:54:55,692 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 07:55:09,026 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1759812.0, ans=0.0 2023-06-27 07:55:10,078 INFO [train.py:996] (0/4) Epoch 10, batch 18850, loss[loss=0.2343, simple_loss=0.293, pruned_loss=0.08773, over 19977.00 frames. ], tot_loss[loss=0.1971, simple_loss=0.2744, pruned_loss=0.05986, over 4254679.80 frames. ], batch size: 703, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:55:19,278 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1759812.0, ans=0.5 2023-06-27 07:55:38,848 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1759872.0, ans=0.2 2023-06-27 07:55:38,929 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1759872.0, ans=0.125 2023-06-27 07:55:52,827 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.06 vs. limit=15.0 2023-06-27 07:56:23,527 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.478e+02 5.405e+02 6.936e+02 9.507e+02 2.005e+03, threshold=1.387e+03, percent-clipped=0.0 2023-06-27 07:56:42,203 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn1.whiten.whitening_limit, batch_count=1760052.0, ans=22.5 2023-06-27 07:56:55,265 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1760112.0, ans=0.0 2023-06-27 07:56:56,202 INFO [train.py:996] (0/4) Epoch 10, batch 18900, loss[loss=0.2241, simple_loss=0.2872, pruned_loss=0.08053, over 21355.00 frames. ], tot_loss[loss=0.1963, simple_loss=0.272, pruned_loss=0.06034, over 4255737.45 frames. ], batch size: 143, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:57:54,682 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1760232.0, ans=0.1 2023-06-27 07:57:55,228 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.51 vs. limit=15.0 2023-06-27 07:58:06,855 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.88 vs. limit=15.0 2023-06-27 07:58:23,816 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1760352.0, ans=0.1 2023-06-27 07:58:42,101 INFO [train.py:996] (0/4) Epoch 10, batch 18950, loss[loss=0.2133, simple_loss=0.291, pruned_loss=0.06786, over 21448.00 frames. ], tot_loss[loss=0.1986, simple_loss=0.2738, pruned_loss=0.06174, over 4255440.21 frames. ], batch size: 131, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:59:11,183 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1760472.0, ans=0.125 2023-06-27 07:59:57,333 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.836e+02 7.333e+02 1.084e+03 1.694e+03 3.772e+03, threshold=2.167e+03, percent-clipped=36.0 2023-06-27 07:59:58,172 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1760592.0, ans=0.125 2023-06-27 08:00:13,564 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1760652.0, ans=0.125 2023-06-27 08:00:24,673 INFO [train.py:996] (0/4) Epoch 10, batch 19000, loss[loss=0.2736, simple_loss=0.3464, pruned_loss=0.1004, over 21260.00 frames. ], tot_loss[loss=0.2057, simple_loss=0.2833, pruned_loss=0.0641, over 4229629.22 frames. ], batch size: 143, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 08:00:44,165 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.82 vs. limit=5.0 2023-06-27 08:00:48,350 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1760772.0, ans=0.125 2023-06-27 08:00:51,964 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1760772.0, ans=0.0 2023-06-27 08:01:34,397 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=5.72 vs. limit=22.5 2023-06-27 08:01:35,604 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1760892.0, ans=0.2 2023-06-27 08:01:37,473 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1760892.0, ans=0.2 2023-06-27 08:01:38,157 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.62 vs. limit=15.0 2023-06-27 08:02:06,192 INFO [train.py:996] (0/4) Epoch 10, batch 19050, loss[loss=0.2365, simple_loss=0.305, pruned_loss=0.08401, over 21898.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2892, pruned_loss=0.06802, over 4249542.89 frames. ], batch size: 371, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 08:03:17,548 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.14 vs. limit=15.0 2023-06-27 08:03:20,218 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1761192.0, ans=0.0 2023-06-27 08:03:24,547 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.013e+02 5.908e+02 6.994e+02 9.504e+02 2.053e+03, threshold=1.399e+03, percent-clipped=0.0 2023-06-27 08:03:46,847 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.75 vs. limit=15.0 2023-06-27 08:03:52,620 INFO [train.py:996] (0/4) Epoch 10, batch 19100, loss[loss=0.1976, simple_loss=0.2607, pruned_loss=0.06729, over 21192.00 frames. ], tot_loss[loss=0.2125, simple_loss=0.2878, pruned_loss=0.0686, over 4261269.21 frames. ], batch size: 176, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 08:03:58,843 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.28 vs. limit=12.0 2023-06-27 08:04:11,638 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.19 vs. limit=12.0 2023-06-27 08:04:18,019 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1761372.0, ans=0.05 2023-06-27 08:04:19,686 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1761372.0, ans=0.0 2023-06-27 08:05:29,001 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1761552.0, ans=0.2 2023-06-27 08:05:32,678 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1761552.0, ans=0.125 2023-06-27 08:05:42,414 INFO [train.py:996] (0/4) Epoch 10, batch 19150, loss[loss=0.2915, simple_loss=0.3844, pruned_loss=0.09926, over 21625.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2895, pruned_loss=0.06907, over 4260641.60 frames. ], batch size: 414, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 08:05:45,374 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1761612.0, ans=0.2 2023-06-27 08:05:56,859 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=10.44 vs. limit=15.0 2023-06-27 08:06:22,237 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1761672.0, ans=0.125 2023-06-27 08:06:29,429 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1761732.0, ans=0.05 2023-06-27 08:06:50,440 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 08:06:53,111 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.207e+02 6.138e+02 1.014e+03 1.599e+03 3.928e+03, threshold=2.029e+03, percent-clipped=32.0 2023-06-27 08:07:02,184 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1761852.0, ans=0.1 2023-06-27 08:07:03,740 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1761852.0, ans=0.1 2023-06-27 08:07:26,304 INFO [train.py:996] (0/4) Epoch 10, batch 19200, loss[loss=0.2522, simple_loss=0.3567, pruned_loss=0.07384, over 21931.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.3009, pruned_loss=0.07041, over 4265155.79 frames. ], batch size: 372, lr: 2.92e-03, grad_scale: 32.0 2023-06-27 08:07:32,737 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1761912.0, ans=0.0 2023-06-27 08:07:42,826 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1761912.0, ans=0.125 2023-06-27 08:08:08,589 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1761972.0, ans=0.05 2023-06-27 08:09:06,731 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1762152.0, ans=0.125 2023-06-27 08:09:08,397 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 08:09:12,927 INFO [train.py:996] (0/4) Epoch 10, batch 19250, loss[loss=0.1684, simple_loss=0.2598, pruned_loss=0.03844, over 21444.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.3006, pruned_loss=0.06648, over 4260061.12 frames. ], batch size: 211, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 08:09:56,281 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1762332.0, ans=0.0 2023-06-27 08:10:13,869 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1762392.0, ans=0.5 2023-06-27 08:10:18,974 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1762392.0, ans=0.2 2023-06-27 08:10:23,407 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.711e+02 5.181e+02 6.655e+02 8.936e+02 1.845e+03, threshold=1.331e+03, percent-clipped=0.0 2023-06-27 08:10:59,467 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.69 vs. limit=12.0 2023-06-27 08:10:59,804 INFO [train.py:996] (0/4) Epoch 10, batch 19300, loss[loss=0.2265, simple_loss=0.2931, pruned_loss=0.07996, over 21940.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.2968, pruned_loss=0.06619, over 4267600.94 frames. ], batch size: 113, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:11:12,972 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1762512.0, ans=0.2 2023-06-27 08:11:41,210 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1762572.0, ans=0.125 2023-06-27 08:12:38,138 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1762752.0, ans=0.1 2023-06-27 08:12:52,833 INFO [train.py:996] (0/4) Epoch 10, batch 19350, loss[loss=0.1998, simple_loss=0.2909, pruned_loss=0.05429, over 21736.00 frames. ], tot_loss[loss=0.2089, simple_loss=0.2913, pruned_loss=0.06327, over 4268389.87 frames. ], batch size: 391, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:13:26,301 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.80 vs. limit=12.0 2023-06-27 08:13:29,300 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1762872.0, ans=0.0 2023-06-27 08:13:32,613 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1762932.0, ans=0.125 2023-06-27 08:13:36,216 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1762932.0, ans=0.125 2023-06-27 08:13:39,693 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1762932.0, ans=0.1 2023-06-27 08:14:03,615 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.583e+02 5.604e+02 8.480e+02 1.112e+03 2.601e+03, threshold=1.696e+03, percent-clipped=20.0 2023-06-27 08:14:31,460 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1763052.0, ans=0.1 2023-06-27 08:14:39,190 INFO [train.py:996] (0/4) Epoch 10, batch 19400, loss[loss=0.1954, simple_loss=0.2645, pruned_loss=0.06313, over 21217.00 frames. ], tot_loss[loss=0.2075, simple_loss=0.2898, pruned_loss=0.06263, over 4279021.19 frames. ], batch size: 176, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:14:48,547 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1763112.0, ans=0.0 2023-06-27 08:15:26,191 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1763232.0, ans=0.125 2023-06-27 08:16:15,557 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1763352.0, ans=0.035 2023-06-27 08:16:23,183 INFO [train.py:996] (0/4) Epoch 10, batch 19450, loss[loss=0.2105, simple_loss=0.2923, pruned_loss=0.06432, over 21968.00 frames. ], tot_loss[loss=0.2082, simple_loss=0.2886, pruned_loss=0.06396, over 4286320.93 frames. ], batch size: 113, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:17:34,591 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.998e+02 5.344e+02 8.011e+02 1.240e+03 3.010e+03, threshold=1.602e+03, percent-clipped=14.0 2023-06-27 08:17:46,383 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.61 vs. limit=22.5 2023-06-27 08:17:56,416 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1763652.0, ans=0.125 2023-06-27 08:18:11,413 INFO [train.py:996] (0/4) Epoch 10, batch 19500, loss[loss=0.2045, simple_loss=0.2742, pruned_loss=0.06741, over 21257.00 frames. ], tot_loss[loss=0.2055, simple_loss=0.283, pruned_loss=0.06401, over 4279192.61 frames. ], batch size: 548, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:18:19,416 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.57 vs. limit=15.0 2023-06-27 08:18:33,493 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.46 vs. limit=22.5 2023-06-27 08:19:31,951 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1763952.0, ans=0.125 2023-06-27 08:19:57,001 INFO [train.py:996] (0/4) Epoch 10, batch 19550, loss[loss=0.1813, simple_loss=0.2834, pruned_loss=0.03963, over 21776.00 frames. ], tot_loss[loss=0.2029, simple_loss=0.2801, pruned_loss=0.0628, over 4275031.00 frames. ], batch size: 282, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:20:17,312 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1764072.0, ans=0.1 2023-06-27 08:20:25,970 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1764072.0, ans=0.2 2023-06-27 08:20:26,541 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.88 vs. limit=12.0 2023-06-27 08:20:44,588 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1764132.0, ans=0.0 2023-06-27 08:20:53,135 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1764192.0, ans=0.0 2023-06-27 08:21:01,409 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.262e+02 6.617e+02 9.937e+02 1.346e+03 3.535e+03, threshold=1.987e+03, percent-clipped=18.0 2023-06-27 08:21:41,978 INFO [train.py:996] (0/4) Epoch 10, batch 19600, loss[loss=0.2074, simple_loss=0.2776, pruned_loss=0.06863, over 21640.00 frames. ], tot_loss[loss=0.2043, simple_loss=0.2818, pruned_loss=0.06339, over 4280942.38 frames. ], batch size: 263, lr: 2.91e-03, grad_scale: 32.0 2023-06-27 08:22:28,571 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.98 vs. limit=15.0 2023-06-27 08:23:30,482 INFO [train.py:996] (0/4) Epoch 10, batch 19650, loss[loss=0.2108, simple_loss=0.284, pruned_loss=0.06879, over 21829.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.2869, pruned_loss=0.06716, over 4283159.88 frames. ], batch size: 298, lr: 2.91e-03, grad_scale: 32.0 2023-06-27 08:24:05,038 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1764732.0, ans=0.1 2023-06-27 08:24:24,255 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.23 vs. limit=15.0 2023-06-27 08:24:56,581 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.384e+02 6.290e+02 8.079e+02 1.063e+03 2.506e+03, threshold=1.616e+03, percent-clipped=1.0 2023-06-27 08:25:09,191 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.29 vs. limit=15.0 2023-06-27 08:25:22,637 INFO [train.py:996] (0/4) Epoch 10, batch 19700, loss[loss=0.129, simple_loss=0.1682, pruned_loss=0.04494, over 16434.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2912, pruned_loss=0.06715, over 4274774.96 frames. ], batch size: 61, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:25:36,182 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1764912.0, ans=0.035 2023-06-27 08:27:12,076 INFO [train.py:996] (0/4) Epoch 10, batch 19750, loss[loss=0.2583, simple_loss=0.3451, pruned_loss=0.08576, over 21776.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2982, pruned_loss=0.06833, over 4281676.04 frames. ], batch size: 414, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:27:24,361 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1765212.0, ans=0.1 2023-06-27 08:27:36,855 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1765272.0, ans=0.0 2023-06-27 08:27:48,734 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1765272.0, ans=0.1 2023-06-27 08:28:19,397 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1765332.0, ans=0.0 2023-06-27 08:28:31,236 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1765392.0, ans=0.0 2023-06-27 08:28:33,897 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.148e+02 7.081e+02 1.435e+03 2.263e+03 4.438e+03, threshold=2.870e+03, percent-clipped=43.0 2023-06-27 08:28:58,230 INFO [train.py:996] (0/4) Epoch 10, batch 19800, loss[loss=0.1861, simple_loss=0.2649, pruned_loss=0.05367, over 21724.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.2976, pruned_loss=0.06902, over 4289997.20 frames. ], batch size: 298, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:29:09,876 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=2.97 vs. limit=10.0 2023-06-27 08:29:22,570 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1765572.0, ans=0.1 2023-06-27 08:29:47,432 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1765572.0, ans=0.1 2023-06-27 08:29:59,604 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1765632.0, ans=0.125 2023-06-27 08:30:48,441 INFO [train.py:996] (0/4) Epoch 10, batch 19850, loss[loss=0.1659, simple_loss=0.2509, pruned_loss=0.04048, over 21642.00 frames. ], tot_loss[loss=0.2107, simple_loss=0.2921, pruned_loss=0.06469, over 4287731.94 frames. ], batch size: 230, lr: 2.91e-03, grad_scale: 8.0 2023-06-27 08:31:03,728 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1765812.0, ans=0.5 2023-06-27 08:32:05,978 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.17 vs. limit=15.0 2023-06-27 08:32:13,175 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.649e+02 5.631e+02 8.956e+02 1.493e+03 4.041e+03, threshold=1.791e+03, percent-clipped=3.0 2023-06-27 08:32:35,735 INFO [train.py:996] (0/4) Epoch 10, batch 19900, loss[loss=0.2188, simple_loss=0.2805, pruned_loss=0.07853, over 21319.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.292, pruned_loss=0.0633, over 4283817.11 frames. ], batch size: 471, lr: 2.91e-03, grad_scale: 8.0 2023-06-27 08:32:50,930 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.27 vs. limit=22.5 2023-06-27 08:33:58,781 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1766292.0, ans=0.0 2023-06-27 08:34:29,529 INFO [train.py:996] (0/4) Epoch 10, batch 19950, loss[loss=0.17, simple_loss=0.2412, pruned_loss=0.04939, over 21159.00 frames. ], tot_loss[loss=0.2063, simple_loss=0.2868, pruned_loss=0.06286, over 4279774.74 frames. ], batch size: 176, lr: 2.91e-03, grad_scale: 8.0 2023-06-27 08:35:23,188 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.41 vs. limit=12.0 2023-06-27 08:35:49,402 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.611e+02 4.941e+02 6.532e+02 1.016e+03 1.667e+03, threshold=1.306e+03, percent-clipped=0.0 2023-06-27 08:35:55,189 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1766652.0, ans=0.0 2023-06-27 08:36:21,396 INFO [train.py:996] (0/4) Epoch 10, batch 20000, loss[loss=0.2398, simple_loss=0.3224, pruned_loss=0.07859, over 21725.00 frames. ], tot_loss[loss=0.2067, simple_loss=0.2869, pruned_loss=0.06326, over 4275027.65 frames. ], batch size: 414, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:36:30,926 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.91 vs. limit=15.0 2023-06-27 08:36:56,730 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1766772.0, ans=0.0 2023-06-27 08:37:42,949 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1766952.0, ans=0.0 2023-06-27 08:37:48,137 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1766952.0, ans=0.2 2023-06-27 08:37:53,121 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=1766952.0, ans=10.0 2023-06-27 08:38:03,073 INFO [train.py:996] (0/4) Epoch 10, batch 20050, loss[loss=0.2505, simple_loss=0.3134, pruned_loss=0.09378, over 21771.00 frames. ], tot_loss[loss=0.2099, simple_loss=0.2883, pruned_loss=0.0658, over 4277967.41 frames. ], batch size: 441, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:38:15,738 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1767012.0, ans=0.125 2023-06-27 08:38:34,791 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.71 vs. limit=10.0 2023-06-27 08:38:36,124 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1767072.0, ans=0.125 2023-06-27 08:38:50,969 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.23 vs. limit=22.5 2023-06-27 08:39:18,082 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.352e+02 5.758e+02 8.028e+02 1.108e+03 2.385e+03, threshold=1.606e+03, percent-clipped=14.0 2023-06-27 08:39:56,999 INFO [train.py:996] (0/4) Epoch 10, batch 20100, loss[loss=0.3316, simple_loss=0.4136, pruned_loss=0.1249, over 21469.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2908, pruned_loss=0.06818, over 4281499.37 frames. ], batch size: 507, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:40:47,997 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.75 vs. limit=15.0 2023-06-27 08:41:44,178 INFO [train.py:996] (0/4) Epoch 10, batch 20150, loss[loss=0.2445, simple_loss=0.3304, pruned_loss=0.07934, over 21813.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2992, pruned_loss=0.0707, over 4283995.64 frames. ], batch size: 124, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:43:08,916 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.695e+02 8.899e+02 1.364e+03 1.871e+03 4.503e+03, threshold=2.728e+03, percent-clipped=36.0 2023-06-27 08:43:31,395 INFO [train.py:996] (0/4) Epoch 10, batch 20200, loss[loss=0.2816, simple_loss=0.3732, pruned_loss=0.09495, over 21639.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.3042, pruned_loss=0.07263, over 4282789.02 frames. ], batch size: 441, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:43:44,969 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1767912.0, ans=0.0 2023-06-27 08:44:39,632 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1768092.0, ans=0.125 2023-06-27 08:44:45,685 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.13 vs. limit=6.0 2023-06-27 08:44:58,921 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1768092.0, ans=0.1 2023-06-27 08:45:19,244 INFO [train.py:996] (0/4) Epoch 10, batch 20250, loss[loss=0.2094, simple_loss=0.2832, pruned_loss=0.06775, over 21225.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.3049, pruned_loss=0.07172, over 4277704.70 frames. ], batch size: 143, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:46:01,172 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1768332.0, ans=0.0 2023-06-27 08:46:37,986 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.153e+02 5.975e+02 7.847e+02 1.054e+03 2.189e+03, threshold=1.569e+03, percent-clipped=0.0 2023-06-27 08:46:47,555 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.75 vs. limit=10.0 2023-06-27 08:46:59,749 INFO [train.py:996] (0/4) Epoch 10, batch 20300, loss[loss=0.1827, simple_loss=0.2581, pruned_loss=0.05361, over 21948.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.3031, pruned_loss=0.06967, over 4282574.27 frames. ], batch size: 107, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:47:25,523 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1768572.0, ans=0.125 2023-06-27 08:48:15,305 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1768692.0, ans=0.1 2023-06-27 08:48:20,992 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.96 vs. limit=12.0 2023-06-27 08:48:36,068 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1768752.0, ans=0.0 2023-06-27 08:48:40,510 INFO [train.py:996] (0/4) Epoch 10, batch 20350, loss[loss=0.2245, simple_loss=0.3019, pruned_loss=0.07355, over 21862.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.3018, pruned_loss=0.06924, over 4260125.29 frames. ], batch size: 118, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:48:54,030 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.19 vs. limit=15.0 2023-06-27 08:49:02,380 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1768872.0, ans=0.125 2023-06-27 08:50:07,281 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.767e+02 5.725e+02 9.452e+02 1.415e+03 2.531e+03, threshold=1.890e+03, percent-clipped=19.0 2023-06-27 08:50:29,305 INFO [train.py:996] (0/4) Epoch 10, batch 20400, loss[loss=0.2283, simple_loss=0.2993, pruned_loss=0.0787, over 21301.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.3035, pruned_loss=0.07112, over 4260662.13 frames. ], batch size: 159, lr: 2.91e-03, grad_scale: 32.0 2023-06-27 08:50:42,117 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1769112.0, ans=0.1 2023-06-27 08:50:45,895 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1769172.0, ans=0.0 2023-06-27 08:50:50,960 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1769172.0, ans=0.125 2023-06-27 08:51:52,246 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1769292.0, ans=0.0 2023-06-27 08:52:08,535 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 08:52:10,086 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1769352.0, ans=0.2 2023-06-27 08:52:12,685 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.67 vs. limit=22.5 2023-06-27 08:52:16,222 INFO [train.py:996] (0/4) Epoch 10, batch 20450, loss[loss=0.2099, simple_loss=0.2822, pruned_loss=0.06878, over 21816.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.3052, pruned_loss=0.07364, over 4252859.88 frames. ], batch size: 298, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:52:16,704 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1769412.0, ans=0.0 2023-06-27 08:52:58,034 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1769472.0, ans=0.125 2023-06-27 08:53:24,129 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1769532.0, ans=0.1 2023-06-27 08:53:41,169 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1769592.0, ans=0.125 2023-06-27 08:53:42,097 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.405e+02 6.127e+02 7.186e+02 1.014e+03 1.873e+03, threshold=1.437e+03, percent-clipped=1.0 2023-06-27 08:54:02,078 INFO [train.py:996] (0/4) Epoch 10, batch 20500, loss[loss=0.1958, simple_loss=0.267, pruned_loss=0.06224, over 21795.00 frames. ], tot_loss[loss=0.224, simple_loss=0.3014, pruned_loss=0.07329, over 4259865.50 frames. ], batch size: 351, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:55:20,441 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.15 vs. limit=15.0 2023-06-27 08:55:37,131 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1769952.0, ans=0.125 2023-06-27 08:55:48,845 INFO [train.py:996] (0/4) Epoch 10, batch 20550, loss[loss=0.211, simple_loss=0.3033, pruned_loss=0.05939, over 21660.00 frames. ], tot_loss[loss=0.219, simple_loss=0.2945, pruned_loss=0.07173, over 4260025.90 frames. ], batch size: 332, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:55:58,318 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1770012.0, ans=0.0 2023-06-27 08:56:00,125 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1770012.0, ans=0.0 2023-06-27 08:56:20,815 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1770072.0, ans=0.125 2023-06-27 08:57:14,707 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.809e+02 5.326e+02 8.786e+02 1.599e+03 3.543e+03, threshold=1.757e+03, percent-clipped=26.0 2023-06-27 08:57:22,134 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1770252.0, ans=0.0 2023-06-27 08:57:34,958 INFO [train.py:996] (0/4) Epoch 10, batch 20600, loss[loss=0.1914, simple_loss=0.2603, pruned_loss=0.06132, over 21452.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.2956, pruned_loss=0.07042, over 4261166.25 frames. ], batch size: 211, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:57:36,411 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.84 vs. limit=15.0 2023-06-27 08:57:54,377 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1770372.0, ans=0.0 2023-06-27 08:58:52,335 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.10 vs. limit=22.5 2023-06-27 08:59:17,192 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1770552.0, ans=0.125 2023-06-27 08:59:19,955 INFO [train.py:996] (0/4) Epoch 10, batch 20650, loss[loss=0.1985, simple_loss=0.2606, pruned_loss=0.06821, over 21356.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2926, pruned_loss=0.07077, over 4252550.56 frames. ], batch size: 131, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:59:57,928 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1770672.0, ans=0.125 2023-06-27 09:00:12,583 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.84 vs. limit=15.0 2023-06-27 09:00:45,103 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.379e+02 5.610e+02 8.426e+02 1.372e+03 2.943e+03, threshold=1.685e+03, percent-clipped=16.0 2023-06-27 09:00:51,441 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1770852.0, ans=0.0 2023-06-27 09:00:58,142 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1770852.0, ans=0.125 2023-06-27 09:01:06,348 INFO [train.py:996] (0/4) Epoch 10, batch 20700, loss[loss=0.2384, simple_loss=0.3418, pruned_loss=0.06752, over 21226.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.2858, pruned_loss=0.06764, over 4253114.40 frames. ], batch size: 548, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 09:01:28,274 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1770972.0, ans=0.2 2023-06-27 09:01:44,953 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1770972.0, ans=0.125 2023-06-27 09:02:16,057 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1771092.0, ans=0.0 2023-06-27 09:02:22,057 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1771092.0, ans=0.125 2023-06-27 09:02:22,640 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.75 vs. limit=15.0 2023-06-27 09:02:51,370 INFO [train.py:996] (0/4) Epoch 10, batch 20750, loss[loss=0.1754, simple_loss=0.2273, pruned_loss=0.06178, over 20970.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.2881, pruned_loss=0.06748, over 4243763.11 frames. ], batch size: 608, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 09:02:54,289 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.13 vs. limit=15.0 2023-06-27 09:03:19,293 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 09:03:38,620 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.63 vs. limit=15.0 2023-06-27 09:04:13,387 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.912e+02 7.283e+02 1.259e+03 1.890e+03 5.387e+03, threshold=2.519e+03, percent-clipped=32.0 2023-06-27 09:04:27,326 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.75 vs. limit=15.0 2023-06-27 09:04:39,264 INFO [train.py:996] (0/4) Epoch 10, batch 20800, loss[loss=0.1804, simple_loss=0.2488, pruned_loss=0.05598, over 21835.00 frames. ], tot_loss[loss=0.213, simple_loss=0.2905, pruned_loss=0.06772, over 4241256.36 frames. ], batch size: 118, lr: 2.91e-03, grad_scale: 32.0 2023-06-27 09:06:20,407 INFO [train.py:996] (0/4) Epoch 10, batch 20850, loss[loss=0.1526, simple_loss=0.2221, pruned_loss=0.04153, over 21558.00 frames. ], tot_loss[loss=0.2091, simple_loss=0.2862, pruned_loss=0.06596, over 4239979.50 frames. ], batch size: 212, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 09:06:48,496 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=1771812.0, ans=0.5 2023-06-27 09:07:03,869 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1771872.0, ans=0.125 2023-06-27 09:07:26,373 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1771932.0, ans=0.0 2023-06-27 09:07:36,753 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1771992.0, ans=0.125 2023-06-27 09:07:48,088 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.047e+02 6.784e+02 1.035e+03 1.709e+03 3.199e+03, threshold=2.070e+03, percent-clipped=7.0 2023-06-27 09:07:51,061 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.62 vs. limit=22.5 2023-06-27 09:08:12,506 INFO [train.py:996] (0/4) Epoch 10, batch 20900, loss[loss=0.2007, simple_loss=0.276, pruned_loss=0.06265, over 21247.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2865, pruned_loss=0.06696, over 4255978.12 frames. ], batch size: 159, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 09:08:55,648 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1772172.0, ans=0.1 2023-06-27 09:09:07,970 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1772232.0, ans=0.125 2023-06-27 09:09:12,811 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 09:09:37,064 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.08 vs. limit=15.0 2023-06-27 09:09:41,453 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1772352.0, ans=0.2 2023-06-27 09:09:54,331 INFO [train.py:996] (0/4) Epoch 10, batch 20950, loss[loss=0.1947, simple_loss=0.2655, pruned_loss=0.06195, over 21605.00 frames. ], tot_loss[loss=0.2061, simple_loss=0.2831, pruned_loss=0.06453, over 4262260.77 frames. ], batch size: 230, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 09:11:06,784 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1772592.0, ans=0.125 2023-06-27 09:11:10,125 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1772592.0, ans=0.0 2023-06-27 09:11:19,703 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.843e+02 5.854e+02 8.072e+02 1.179e+03 2.171e+03, threshold=1.614e+03, percent-clipped=1.0 2023-06-27 09:11:36,321 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.74 vs. limit=15.0 2023-06-27 09:11:38,294 INFO [train.py:996] (0/4) Epoch 10, batch 21000, loss[loss=0.2058, simple_loss=0.2912, pruned_loss=0.06016, over 21863.00 frames. ], tot_loss[loss=0.205, simple_loss=0.2813, pruned_loss=0.06428, over 4257788.43 frames. ], batch size: 316, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 09:11:38,295 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-27 09:12:02,870 INFO [train.py:1028] (0/4) Epoch 10, validation: loss=0.2606, simple_loss=0.3545, pruned_loss=0.08334, over 1796401.00 frames. 2023-06-27 09:12:02,871 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23714MB 2023-06-27 09:13:42,559 INFO [train.py:996] (0/4) Epoch 10, batch 21050, loss[loss=0.1585, simple_loss=0.2334, pruned_loss=0.04182, over 16399.00 frames. ], tot_loss[loss=0.2042, simple_loss=0.2794, pruned_loss=0.06454, over 4251306.40 frames. ], batch size: 63, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 09:13:44,760 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1773012.0, ans=0.0 2023-06-27 09:14:09,276 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1773072.0, ans=0.0 2023-06-27 09:14:32,786 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1773132.0, ans=0.125 2023-06-27 09:14:36,591 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.21 vs. limit=22.5 2023-06-27 09:14:59,143 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.065e+02 6.476e+02 8.191e+02 1.141e+03 2.345e+03, threshold=1.638e+03, percent-clipped=6.0 2023-06-27 09:15:23,651 INFO [train.py:996] (0/4) Epoch 10, batch 21100, loss[loss=0.1951, simple_loss=0.2679, pruned_loss=0.06117, over 21404.00 frames. ], tot_loss[loss=0.2024, simple_loss=0.2765, pruned_loss=0.06409, over 4245259.74 frames. ], batch size: 389, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 09:15:34,248 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1773312.0, ans=0.125 2023-06-27 09:16:07,506 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1773372.0, ans=0.125 2023-06-27 09:17:06,137 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1773552.0, ans=0.1 2023-06-27 09:17:08,855 INFO [train.py:996] (0/4) Epoch 10, batch 21150, loss[loss=0.1907, simple_loss=0.2557, pruned_loss=0.06282, over 21823.00 frames. ], tot_loss[loss=0.2005, simple_loss=0.2721, pruned_loss=0.06445, over 4252222.57 frames. ], batch size: 352, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 09:18:05,513 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.57 vs. limit=12.0 2023-06-27 09:18:32,207 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1773792.0, ans=0.1 2023-06-27 09:18:33,089 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.032e+02 6.384e+02 8.623e+02 1.133e+03 2.526e+03, threshold=1.725e+03, percent-clipped=9.0 2023-06-27 09:18:45,926 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1773852.0, ans=0.125 2023-06-27 09:18:51,937 INFO [train.py:996] (0/4) Epoch 10, batch 21200, loss[loss=0.2062, simple_loss=0.267, pruned_loss=0.07269, over 21482.00 frames. ], tot_loss[loss=0.1976, simple_loss=0.2684, pruned_loss=0.06336, over 4250774.09 frames. ], batch size: 441, lr: 2.91e-03, grad_scale: 32.0 2023-06-27 09:19:58,580 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1774032.0, ans=0.0 2023-06-27 09:20:25,168 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1774152.0, ans=0.1 2023-06-27 09:20:32,152 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1774152.0, ans=0.125 2023-06-27 09:20:44,836 INFO [train.py:996] (0/4) Epoch 10, batch 21250, loss[loss=0.1957, simple_loss=0.2694, pruned_loss=0.06102, over 21618.00 frames. ], tot_loss[loss=0.1957, simple_loss=0.2662, pruned_loss=0.06257, over 4247980.70 frames. ], batch size: 263, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 09:21:16,056 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.08 vs. limit=22.5 2023-06-27 09:21:56,060 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1774392.0, ans=0.125 2023-06-27 09:22:00,986 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1774392.0, ans=0.125 2023-06-27 09:22:08,474 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.773e+02 6.524e+02 9.029e+02 1.391e+03 2.253e+03, threshold=1.806e+03, percent-clipped=10.0 2023-06-27 09:22:11,048 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1774452.0, ans=0.0 2023-06-27 09:22:25,325 INFO [train.py:996] (0/4) Epoch 10, batch 21300, loss[loss=0.229, simple_loss=0.3022, pruned_loss=0.07793, over 21910.00 frames. ], tot_loss[loss=0.2007, simple_loss=0.2721, pruned_loss=0.06469, over 4243305.56 frames. ], batch size: 107, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 09:22:38,149 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.65 vs. limit=15.0 2023-06-27 09:23:23,934 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.19 vs. limit=6.0 2023-06-27 09:23:34,102 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.51 vs. limit=12.0 2023-06-27 09:23:48,350 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.73 vs. limit=15.0 2023-06-27 09:24:13,056 INFO [train.py:996] (0/4) Epoch 10, batch 21350, loss[loss=0.1948, simple_loss=0.2692, pruned_loss=0.06018, over 21325.00 frames. ], tot_loss[loss=0.205, simple_loss=0.2777, pruned_loss=0.06616, over 4254762.23 frames. ], batch size: 176, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 09:24:54,197 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1774872.0, ans=0.025 2023-06-27 09:25:24,196 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1774992.0, ans=0.2 2023-06-27 09:25:33,054 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1774992.0, ans=0.2 2023-06-27 09:25:38,215 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1775052.0, ans=0.0 2023-06-27 09:25:39,166 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.868e+02 6.492e+02 8.824e+02 1.457e+03 2.432e+03, threshold=1.765e+03, percent-clipped=7.0 2023-06-27 09:26:01,040 INFO [train.py:996] (0/4) Epoch 10, batch 21400, loss[loss=0.2465, simple_loss=0.3258, pruned_loss=0.08359, over 21553.00 frames. ], tot_loss[loss=0.2055, simple_loss=0.2802, pruned_loss=0.06542, over 4260469.11 frames. ], batch size: 414, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 09:26:11,385 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1775112.0, ans=0.0 2023-06-27 09:26:25,321 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1775112.0, ans=0.125 2023-06-27 09:26:29,430 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.50 vs. limit=10.0 2023-06-27 09:26:43,959 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1775172.0, ans=0.05 2023-06-27 09:27:15,777 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1775292.0, ans=0.95 2023-06-27 09:27:47,928 INFO [train.py:996] (0/4) Epoch 10, batch 21450, loss[loss=0.2292, simple_loss=0.3055, pruned_loss=0.07642, over 21851.00 frames. ], tot_loss[loss=0.2091, simple_loss=0.2845, pruned_loss=0.0669, over 4270047.47 frames. ], batch size: 118, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 09:28:24,351 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.49 vs. limit=15.0 2023-06-27 09:28:34,006 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1775532.0, ans=0.95 2023-06-27 09:28:51,534 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.38 vs. limit=12.0 2023-06-27 09:29:12,486 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.429e+02 6.380e+02 8.632e+02 1.324e+03 3.087e+03, threshold=1.726e+03, percent-clipped=6.0 2023-06-27 09:29:21,216 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1775652.0, ans=0.125 2023-06-27 09:29:39,668 INFO [train.py:996] (0/4) Epoch 10, batch 21500, loss[loss=0.1771, simple_loss=0.2475, pruned_loss=0.0534, over 21667.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2843, pruned_loss=0.06826, over 4267186.57 frames. ], batch size: 264, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 09:29:57,632 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1775712.0, ans=0.1 2023-06-27 09:30:09,587 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.77 vs. limit=15.0 2023-06-27 09:30:27,004 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1775832.0, ans=0.0 2023-06-27 09:31:06,115 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1775952.0, ans=0.04949747468305833 2023-06-27 09:31:08,892 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-296000.pt 2023-06-27 09:31:25,400 INFO [train.py:996] (0/4) Epoch 10, batch 21550, loss[loss=0.1736, simple_loss=0.2447, pruned_loss=0.05127, over 21730.00 frames. ], tot_loss[loss=0.2041, simple_loss=0.2775, pruned_loss=0.06536, over 4265152.37 frames. ], batch size: 282, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 09:31:42,575 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1776012.0, ans=0.125 2023-06-27 09:31:46,555 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.95 vs. limit=15.0 2023-06-27 09:32:18,883 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.68 vs. limit=15.0 2023-06-27 09:32:29,029 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 09:32:34,663 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1776192.0, ans=0.1 2023-06-27 09:32:45,485 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.045e+02 5.690e+02 8.515e+02 1.276e+03 3.905e+03, threshold=1.703e+03, percent-clipped=13.0 2023-06-27 09:33:20,026 INFO [train.py:996] (0/4) Epoch 10, batch 21600, loss[loss=0.1793, simple_loss=0.251, pruned_loss=0.05381, over 21779.00 frames. ], tot_loss[loss=0.2014, simple_loss=0.2746, pruned_loss=0.0641, over 4269472.18 frames. ], batch size: 124, lr: 2.90e-03, grad_scale: 32.0 2023-06-27 09:33:57,770 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1776432.0, ans=0.125 2023-06-27 09:34:08,342 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1776432.0, ans=0.125 2023-06-27 09:34:13,394 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1776492.0, ans=0.0 2023-06-27 09:34:15,022 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1776492.0, ans=0.5 2023-06-27 09:34:31,624 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1776492.0, ans=0.0 2023-06-27 09:35:06,685 INFO [train.py:996] (0/4) Epoch 10, batch 21650, loss[loss=0.2809, simple_loss=0.3727, pruned_loss=0.09454, over 21470.00 frames. ], tot_loss[loss=0.2012, simple_loss=0.2779, pruned_loss=0.06228, over 4270464.10 frames. ], batch size: 507, lr: 2.90e-03, grad_scale: 32.0 2023-06-27 09:35:07,153 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1776612.0, ans=0.0 2023-06-27 09:35:24,193 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1776672.0, ans=0.125 2023-06-27 09:35:52,833 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1776732.0, ans=0.0 2023-06-27 09:35:58,348 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.59 vs. limit=15.0 2023-06-27 09:36:16,539 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1776792.0, ans=0.125 2023-06-27 09:36:26,713 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.628e+02 5.833e+02 8.995e+02 1.569e+03 2.622e+03, threshold=1.799e+03, percent-clipped=22.0 2023-06-27 09:36:53,195 INFO [train.py:996] (0/4) Epoch 10, batch 21700, loss[loss=0.194, simple_loss=0.264, pruned_loss=0.06197, over 21854.00 frames. ], tot_loss[loss=0.2005, simple_loss=0.279, pruned_loss=0.06104, over 4264026.78 frames. ], batch size: 98, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 09:36:58,810 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1776912.0, ans=0.1 2023-06-27 09:37:34,248 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1777032.0, ans=0.125 2023-06-27 09:37:44,469 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.96 vs. limit=15.0 2023-06-27 09:38:38,226 INFO [train.py:996] (0/4) Epoch 10, batch 21750, loss[loss=0.2003, simple_loss=0.2717, pruned_loss=0.06443, over 21825.00 frames. ], tot_loss[loss=0.1992, simple_loss=0.2752, pruned_loss=0.06164, over 4271069.75 frames. ], batch size: 98, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 09:39:14,401 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1777332.0, ans=0.0 2023-06-27 09:39:19,207 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1777332.0, ans=0.0 2023-06-27 09:39:29,825 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1777392.0, ans=0.125 2023-06-27 09:39:31,930 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.44 vs. limit=15.0 2023-06-27 09:39:45,374 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1777392.0, ans=0.125 2023-06-27 09:39:58,332 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.215e+02 5.988e+02 7.907e+02 1.038e+03 1.862e+03, threshold=1.581e+03, percent-clipped=2.0 2023-06-27 09:40:05,988 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1777452.0, ans=0.2 2023-06-27 09:40:24,241 INFO [train.py:996] (0/4) Epoch 10, batch 21800, loss[loss=0.2497, simple_loss=0.3396, pruned_loss=0.07986, over 21823.00 frames. ], tot_loss[loss=0.1996, simple_loss=0.2741, pruned_loss=0.06255, over 4276043.46 frames. ], batch size: 352, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 09:40:43,855 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1777572.0, ans=0.1 2023-06-27 09:40:45,436 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1777572.0, ans=0.125 2023-06-27 09:40:57,816 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.85 vs. limit=15.0 2023-06-27 09:41:59,364 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.30 vs. limit=15.0 2023-06-27 09:42:10,453 INFO [train.py:996] (0/4) Epoch 10, batch 21850, loss[loss=0.2451, simple_loss=0.3436, pruned_loss=0.07329, over 21662.00 frames. ], tot_loss[loss=0.2036, simple_loss=0.2802, pruned_loss=0.0635, over 4258985.23 frames. ], batch size: 414, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 09:42:22,336 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.50 vs. limit=15.0 2023-06-27 09:42:26,917 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1777872.0, ans=0.125 2023-06-27 09:42:45,033 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1777932.0, ans=0.1 2023-06-27 09:42:51,955 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1777932.0, ans=0.2 2023-06-27 09:43:30,249 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.011e+02 6.825e+02 1.374e+03 1.718e+03 3.521e+03, threshold=2.747e+03, percent-clipped=39.0 2023-06-27 09:43:55,453 INFO [train.py:996] (0/4) Epoch 10, batch 21900, loss[loss=0.1726, simple_loss=0.2442, pruned_loss=0.05047, over 21837.00 frames. ], tot_loss[loss=0.204, simple_loss=0.2788, pruned_loss=0.06453, over 4267662.55 frames. ], batch size: 107, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 09:43:59,575 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1778112.0, ans=0.125 2023-06-27 09:44:12,106 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1778172.0, ans=0.0 2023-06-27 09:44:15,742 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1778172.0, ans=0.0 2023-06-27 09:44:24,176 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.43 vs. limit=15.0 2023-06-27 09:45:07,132 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1778292.0, ans=0.0 2023-06-27 09:45:40,211 INFO [train.py:996] (0/4) Epoch 10, batch 21950, loss[loss=0.1926, simple_loss=0.2718, pruned_loss=0.05664, over 21494.00 frames. ], tot_loss[loss=0.2003, simple_loss=0.2748, pruned_loss=0.06294, over 4269242.17 frames. ], batch size: 509, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 09:47:06,212 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.644e+02 5.395e+02 6.536e+02 9.589e+02 2.193e+03, threshold=1.307e+03, percent-clipped=0.0 2023-06-27 09:47:26,911 INFO [train.py:996] (0/4) Epoch 10, batch 22000, loss[loss=0.2046, simple_loss=0.2929, pruned_loss=0.05818, over 21211.00 frames. ], tot_loss[loss=0.196, simple_loss=0.2701, pruned_loss=0.06092, over 4273162.09 frames. ], batch size: 549, lr: 2.90e-03, grad_scale: 32.0 2023-06-27 09:47:28,337 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.41 vs. limit=12.0 2023-06-27 09:48:43,519 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1778892.0, ans=0.125 2023-06-27 09:48:44,163 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.30 vs. limit=15.0 2023-06-27 09:49:18,765 INFO [train.py:996] (0/4) Epoch 10, batch 22050, loss[loss=0.2754, simple_loss=0.3489, pruned_loss=0.101, over 21354.00 frames. ], tot_loss[loss=0.201, simple_loss=0.2763, pruned_loss=0.06283, over 4269910.30 frames. ], batch size: 143, lr: 2.90e-03, grad_scale: 32.0 2023-06-27 09:49:21,179 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1779012.0, ans=0.0 2023-06-27 09:49:24,903 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.71 vs. limit=15.0 2023-06-27 09:50:14,781 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1779132.0, ans=0.1 2023-06-27 09:50:51,717 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.283e+02 7.142e+02 9.080e+02 1.741e+03 3.538e+03, threshold=1.816e+03, percent-clipped=36.0 2023-06-27 09:50:57,145 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1779252.0, ans=0.1 2023-06-27 09:51:05,034 INFO [train.py:996] (0/4) Epoch 10, batch 22100, loss[loss=0.2352, simple_loss=0.3083, pruned_loss=0.08103, over 21787.00 frames. ], tot_loss[loss=0.2094, simple_loss=0.2851, pruned_loss=0.0669, over 4275118.33 frames. ], batch size: 332, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 09:51:44,823 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1779432.0, ans=0.125 2023-06-27 09:52:27,822 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1779492.0, ans=0.1 2023-06-27 09:52:36,787 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=1779552.0, ans=6.0 2023-06-27 09:52:51,248 INFO [train.py:996] (0/4) Epoch 10, batch 22150, loss[loss=0.2097, simple_loss=0.2866, pruned_loss=0.0664, over 21910.00 frames. ], tot_loss[loss=0.2122, simple_loss=0.2878, pruned_loss=0.06826, over 4284063.69 frames. ], batch size: 415, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 09:52:59,499 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.97 vs. limit=15.0 2023-06-27 09:53:02,991 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.60 vs. limit=15.0 2023-06-27 09:53:04,198 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 09:53:12,764 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1779672.0, ans=0.05 2023-06-27 09:53:38,707 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1779732.0, ans=0.2 2023-06-27 09:54:25,384 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.248e+02 5.661e+02 7.475e+02 1.093e+03 2.487e+03, threshold=1.495e+03, percent-clipped=9.0 2023-06-27 09:54:39,194 INFO [train.py:996] (0/4) Epoch 10, batch 22200, loss[loss=0.2295, simple_loss=0.3137, pruned_loss=0.07262, over 21422.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2894, pruned_loss=0.06936, over 4292530.64 frames. ], batch size: 548, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 09:55:09,655 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1779972.0, ans=0.125 2023-06-27 09:56:27,604 INFO [train.py:996] (0/4) Epoch 10, batch 22250, loss[loss=0.2353, simple_loss=0.3083, pruned_loss=0.08111, over 21467.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.2959, pruned_loss=0.07039, over 4288406.06 frames. ], batch size: 211, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 09:56:31,647 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1780212.0, ans=0.0 2023-06-27 09:56:59,370 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1780272.0, ans=10.0 2023-06-27 09:57:01,999 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.57 vs. limit=22.5 2023-06-27 09:57:02,941 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1780272.0, ans=0.0 2023-06-27 09:57:37,882 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.74 vs. limit=15.0 2023-06-27 09:57:58,532 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.450e+02 6.092e+02 1.077e+03 1.479e+03 2.486e+03, threshold=2.154e+03, percent-clipped=24.0 2023-06-27 09:58:12,286 INFO [train.py:996] (0/4) Epoch 10, batch 22300, loss[loss=0.2656, simple_loss=0.3179, pruned_loss=0.1066, over 21635.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.2971, pruned_loss=0.07198, over 4280520.04 frames. ], batch size: 471, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 09:58:50,739 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.54 vs. limit=22.5 2023-06-27 09:59:05,810 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.70 vs. limit=15.0 2023-06-27 09:59:25,918 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1780692.0, ans=0.0 2023-06-27 09:59:58,434 INFO [train.py:996] (0/4) Epoch 10, batch 22350, loss[loss=0.2271, simple_loss=0.3065, pruned_loss=0.07381, over 21610.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2956, pruned_loss=0.07255, over 4295178.79 frames. ], batch size: 473, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 10:01:31,000 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1781052.0, ans=0.125 2023-06-27 10:01:32,057 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.877e+02 5.424e+02 7.099e+02 9.642e+02 1.783e+03, threshold=1.420e+03, percent-clipped=0.0 2023-06-27 10:01:33,608 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.59 vs. limit=15.0 2023-06-27 10:01:45,473 INFO [train.py:996] (0/4) Epoch 10, batch 22400, loss[loss=0.1934, simple_loss=0.2703, pruned_loss=0.05827, over 21603.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.2926, pruned_loss=0.07006, over 4289621.74 frames. ], batch size: 263, lr: 2.90e-03, grad_scale: 32.0 2023-06-27 10:01:47,653 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1781112.0, ans=0.125 2023-06-27 10:01:56,027 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1781112.0, ans=0.125 2023-06-27 10:02:08,315 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.27 vs. limit=15.0 2023-06-27 10:02:36,616 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1781232.0, ans=0.125 2023-06-27 10:02:51,035 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1781232.0, ans=0.0 2023-06-27 10:03:15,081 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1781352.0, ans=0.125 2023-06-27 10:03:31,233 INFO [train.py:996] (0/4) Epoch 10, batch 22450, loss[loss=0.1962, simple_loss=0.2586, pruned_loss=0.06686, over 21331.00 frames. ], tot_loss[loss=0.2123, simple_loss=0.2868, pruned_loss=0.06892, over 4286178.88 frames. ], batch size: 144, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 10:03:46,851 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1781412.0, ans=0.125 2023-06-27 10:04:38,728 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1781532.0, ans=0.1 2023-06-27 10:05:07,722 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.016e+02 6.073e+02 8.479e+02 1.179e+03 3.261e+03, threshold=1.696e+03, percent-clipped=18.0 2023-06-27 10:05:18,410 INFO [train.py:996] (0/4) Epoch 10, batch 22500, loss[loss=0.2163, simple_loss=0.3075, pruned_loss=0.06257, over 21234.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2833, pruned_loss=0.06877, over 4269881.10 frames. ], batch size: 159, lr: 2.90e-03, grad_scale: 8.0 2023-06-27 10:05:32,976 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1781712.0, ans=0.1 2023-06-27 10:06:36,203 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.65 vs. limit=15.0 2023-06-27 10:06:53,318 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1781952.0, ans=0.0 2023-06-27 10:07:06,834 INFO [train.py:996] (0/4) Epoch 10, batch 22550, loss[loss=0.1658, simple_loss=0.218, pruned_loss=0.05677, over 20736.00 frames. ], tot_loss[loss=0.212, simple_loss=0.2859, pruned_loss=0.06908, over 4278381.62 frames. ], batch size: 607, lr: 2.90e-03, grad_scale: 8.0 2023-06-27 10:08:19,817 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1782192.0, ans=0.125 2023-06-27 10:08:41,501 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.419e+02 6.178e+02 1.242e+03 1.950e+03 4.739e+03, threshold=2.485e+03, percent-clipped=29.0 2023-06-27 10:08:51,931 INFO [train.py:996] (0/4) Epoch 10, batch 22600, loss[loss=0.2337, simple_loss=0.3109, pruned_loss=0.07824, over 21746.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2888, pruned_loss=0.06922, over 4281867.52 frames. ], batch size: 351, lr: 2.90e-03, grad_scale: 8.0 2023-06-27 10:08:56,333 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.45 vs. limit=15.0 2023-06-27 10:09:20,224 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1782312.0, ans=0.0 2023-06-27 10:09:36,118 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1782372.0, ans=0.0 2023-06-27 10:09:56,926 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1782432.0, ans=0.035 2023-06-27 10:09:58,718 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1782492.0, ans=0.0 2023-06-27 10:10:17,358 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1782552.0, ans=0.0 2023-06-27 10:10:27,680 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1782552.0, ans=0.0 2023-06-27 10:10:38,568 INFO [train.py:996] (0/4) Epoch 10, batch 22650, loss[loss=0.1814, simple_loss=0.2512, pruned_loss=0.05579, over 21981.00 frames. ], tot_loss[loss=0.211, simple_loss=0.2847, pruned_loss=0.06868, over 4269782.14 frames. ], batch size: 103, lr: 2.90e-03, grad_scale: 8.0 2023-06-27 10:12:16,226 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.186e+02 6.070e+02 1.000e+03 1.313e+03 3.118e+03, threshold=2.001e+03, percent-clipped=3.0 2023-06-27 10:12:26,409 INFO [train.py:996] (0/4) Epoch 10, batch 22700, loss[loss=0.1919, simple_loss=0.253, pruned_loss=0.06539, over 21198.00 frames. ], tot_loss[loss=0.2085, simple_loss=0.2798, pruned_loss=0.0686, over 4256200.61 frames. ], batch size: 176, lr: 2.90e-03, grad_scale: 8.0 2023-06-27 10:13:01,286 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1782972.0, ans=0.5 2023-06-27 10:13:06,965 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.61 vs. limit=12.0 2023-06-27 10:13:24,019 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1783032.0, ans=0.125 2023-06-27 10:13:42,733 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1783092.0, ans=0.125 2023-06-27 10:13:51,050 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1783152.0, ans=0.2 2023-06-27 10:13:53,257 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.54 vs. limit=22.5 2023-06-27 10:14:12,783 INFO [train.py:996] (0/4) Epoch 10, batch 22750, loss[loss=0.2485, simple_loss=0.3279, pruned_loss=0.0846, over 21501.00 frames. ], tot_loss[loss=0.2108, simple_loss=0.2829, pruned_loss=0.06931, over 4259183.37 frames. ], batch size: 131, lr: 2.90e-03, grad_scale: 8.0 2023-06-27 10:14:49,652 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1783272.0, ans=0.0 2023-06-27 10:14:58,245 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1783272.0, ans=0.0 2023-06-27 10:15:08,504 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1783332.0, ans=0.125 2023-06-27 10:15:11,943 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1783332.0, ans=0.2 2023-06-27 10:15:17,106 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1783332.0, ans=10.0 2023-06-27 10:15:20,720 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1783392.0, ans=0.125 2023-06-27 10:15:41,421 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=1783452.0, ans=15.0 2023-06-27 10:15:48,221 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.913e+02 6.203e+02 1.029e+03 1.531e+03 3.011e+03, threshold=2.057e+03, percent-clipped=6.0 2023-06-27 10:15:53,338 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.61 vs. limit=12.0 2023-06-27 10:16:04,101 INFO [train.py:996] (0/4) Epoch 10, batch 22800, loss[loss=0.211, simple_loss=0.2868, pruned_loss=0.06767, over 21853.00 frames. ], tot_loss[loss=0.215, simple_loss=0.2868, pruned_loss=0.07163, over 4265945.01 frames. ], batch size: 414, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 10:16:51,296 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.30 vs. limit=15.0 2023-06-27 10:17:32,185 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.67 vs. limit=15.0 2023-06-27 10:17:44,772 INFO [train.py:996] (0/4) Epoch 10, batch 22850, loss[loss=0.1858, simple_loss=0.2521, pruned_loss=0.05973, over 21674.00 frames. ], tot_loss[loss=0.2118, simple_loss=0.2826, pruned_loss=0.07052, over 4276542.32 frames. ], batch size: 299, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 10:18:34,927 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1783932.0, ans=0.0 2023-06-27 10:18:59,747 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1783992.0, ans=0.09899494936611666 2023-06-27 10:19:23,325 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.547e+02 9.815e+02 1.470e+03 2.221e+03 4.175e+03, threshold=2.939e+03, percent-clipped=31.0 2023-06-27 10:19:44,589 INFO [train.py:996] (0/4) Epoch 10, batch 22900, loss[loss=0.196, simple_loss=0.2989, pruned_loss=0.04659, over 21541.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.283, pruned_loss=0.0698, over 4278033.28 frames. ], batch size: 230, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 10:20:05,356 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1784172.0, ans=0.0 2023-06-27 10:21:31,355 INFO [train.py:996] (0/4) Epoch 10, batch 22950, loss[loss=0.2369, simple_loss=0.369, pruned_loss=0.05245, over 21278.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.2949, pruned_loss=0.06897, over 4277621.12 frames. ], batch size: 548, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 10:22:29,716 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1784592.0, ans=0.125 2023-06-27 10:22:51,598 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.863e+02 5.875e+02 8.793e+02 1.271e+03 3.173e+03, threshold=1.759e+03, percent-clipped=4.0 2023-06-27 10:23:04,586 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1784712.0, ans=0.125 2023-06-27 10:23:05,430 INFO [train.py:996] (0/4) Epoch 10, batch 23000, loss[loss=0.2187, simple_loss=0.2898, pruned_loss=0.07385, over 21509.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2967, pruned_loss=0.06695, over 4278543.04 frames. ], batch size: 548, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 10:23:11,747 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.78 vs. limit=22.5 2023-06-27 10:23:29,466 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.66 vs. limit=15.0 2023-06-27 10:23:47,049 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1784832.0, ans=0.1 2023-06-27 10:23:55,214 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1784832.0, ans=0.2 2023-06-27 10:24:11,293 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1784892.0, ans=0.1 2023-06-27 10:24:34,152 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1784952.0, ans=0.2 2023-06-27 10:24:46,694 INFO [train.py:996] (0/4) Epoch 10, batch 23050, loss[loss=0.293, simple_loss=0.3493, pruned_loss=0.1184, over 21358.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2974, pruned_loss=0.06878, over 4284420.00 frames. ], batch size: 508, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 10:25:36,358 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.47 vs. limit=6.0 2023-06-27 10:25:49,848 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1785192.0, ans=0.0 2023-06-27 10:25:54,696 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1785192.0, ans=0.0 2023-06-27 10:26:17,102 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.088e+02 5.488e+02 7.273e+02 1.121e+03 2.826e+03, threshold=1.455e+03, percent-clipped=6.0 2023-06-27 10:26:26,915 INFO [train.py:996] (0/4) Epoch 10, batch 23100, loss[loss=0.1874, simple_loss=0.2511, pruned_loss=0.06181, over 21244.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2926, pruned_loss=0.06877, over 4282279.93 frames. ], batch size: 144, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 10:27:13,130 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.50 vs. limit=6.0 2023-06-27 10:27:22,408 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1785492.0, ans=0.0 2023-06-27 10:27:30,963 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.95 vs. limit=15.0 2023-06-27 10:28:02,058 INFO [train.py:996] (0/4) Epoch 10, batch 23150, loss[loss=0.2365, simple_loss=0.2974, pruned_loss=0.08779, over 21779.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.2873, pruned_loss=0.06872, over 4281673.91 frames. ], batch size: 441, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 10:29:21,503 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1785852.0, ans=0.125 2023-06-27 10:29:25,878 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.152e+02 5.952e+02 7.532e+02 1.121e+03 2.900e+03, threshold=1.506e+03, percent-clipped=14.0 2023-06-27 10:29:35,447 INFO [train.py:996] (0/4) Epoch 10, batch 23200, loss[loss=0.1997, simple_loss=0.2681, pruned_loss=0.06562, over 21862.00 frames. ], tot_loss[loss=0.212, simple_loss=0.2859, pruned_loss=0.06902, over 4285311.86 frames. ], batch size: 247, lr: 2.90e-03, grad_scale: 32.0 2023-06-27 10:30:05,528 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.16 vs. limit=10.0 2023-06-27 10:30:06,899 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1786032.0, ans=10.0 2023-06-27 10:31:10,863 INFO [train.py:996] (0/4) Epoch 10, batch 23250, loss[loss=0.2256, simple_loss=0.2924, pruned_loss=0.07934, over 21748.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.2866, pruned_loss=0.07043, over 4287371.72 frames. ], batch size: 389, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 10:31:28,027 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1786272.0, ans=0.2 2023-06-27 10:31:32,948 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1786272.0, ans=0.125 2023-06-27 10:31:36,440 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1786272.0, ans=0.125 2023-06-27 10:32:00,777 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1786332.0, ans=0.125 2023-06-27 10:32:44,613 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.481e+02 7.308e+02 1.025e+03 1.554e+03 3.146e+03, threshold=2.050e+03, percent-clipped=26.0 2023-06-27 10:32:52,912 INFO [train.py:996] (0/4) Epoch 10, batch 23300, loss[loss=0.2549, simple_loss=0.3514, pruned_loss=0.07919, over 21715.00 frames. ], tot_loss[loss=0.2181, simple_loss=0.2935, pruned_loss=0.07131, over 4290558.65 frames. ], batch size: 441, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 10:34:18,114 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1786752.0, ans=0.125 2023-06-27 10:34:33,874 INFO [train.py:996] (0/4) Epoch 10, batch 23350, loss[loss=0.2406, simple_loss=0.3283, pruned_loss=0.07642, over 21309.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.297, pruned_loss=0.0704, over 4288720.56 frames. ], batch size: 549, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 10:34:37,719 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1786812.0, ans=0.125 2023-06-27 10:34:42,553 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1786812.0, ans=0.0 2023-06-27 10:35:00,044 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1786872.0, ans=0.2 2023-06-27 10:35:11,381 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1786872.0, ans=0.125 2023-06-27 10:35:28,802 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1786932.0, ans=0.2 2023-06-27 10:36:05,515 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.446e+02 7.072e+02 1.049e+03 1.355e+03 2.858e+03, threshold=2.098e+03, percent-clipped=8.0 2023-06-27 10:36:13,599 INFO [train.py:996] (0/4) Epoch 10, batch 23400, loss[loss=0.2206, simple_loss=0.289, pruned_loss=0.07615, over 21545.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2911, pruned_loss=0.06723, over 4290483.83 frames. ], batch size: 548, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 10:36:16,011 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 10:37:35,715 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1787292.0, ans=0.04949747468305833 2023-06-27 10:37:37,307 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1787352.0, ans=0.025 2023-06-27 10:37:52,176 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1787352.0, ans=0.0 2023-06-27 10:37:54,847 INFO [train.py:996] (0/4) Epoch 10, batch 23450, loss[loss=0.2447, simple_loss=0.3104, pruned_loss=0.08945, over 21849.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2911, pruned_loss=0.06853, over 4287887.19 frames. ], batch size: 247, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 10:38:20,679 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1787472.0, ans=0.125 2023-06-27 10:39:09,816 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1787592.0, ans=0.07 2023-06-27 10:39:25,537 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.955e+02 6.626e+02 1.004e+03 1.261e+03 2.377e+03, threshold=2.009e+03, percent-clipped=2.0 2023-06-27 10:39:38,048 INFO [train.py:996] (0/4) Epoch 10, batch 23500, loss[loss=0.2137, simple_loss=0.2778, pruned_loss=0.07484, over 21472.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2914, pruned_loss=0.07021, over 4291854.42 frames. ], batch size: 194, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 10:39:44,048 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1787712.0, ans=0.125 2023-06-27 10:40:46,337 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.37 vs. limit=12.0 2023-06-27 10:41:08,454 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1787952.0, ans=0.2 2023-06-27 10:41:16,222 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1788012.0, ans=0.0 2023-06-27 10:41:17,176 INFO [train.py:996] (0/4) Epoch 10, batch 23550, loss[loss=0.2074, simple_loss=0.267, pruned_loss=0.0739, over 21196.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.287, pruned_loss=0.07003, over 4293467.14 frames. ], batch size: 143, lr: 2.89e-03, grad_scale: 8.0 2023-06-27 10:41:20,083 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.06 vs. limit=15.0 2023-06-27 10:41:31,045 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1788012.0, ans=0.1 2023-06-27 10:41:48,883 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1788072.0, ans=0.125 2023-06-27 10:42:47,401 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.468e+02 6.896e+02 9.643e+02 1.434e+03 2.789e+03, threshold=1.929e+03, percent-clipped=7.0 2023-06-27 10:42:55,962 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1788252.0, ans=0.0 2023-06-27 10:42:58,648 INFO [train.py:996] (0/4) Epoch 10, batch 23600, loss[loss=0.2532, simple_loss=0.3279, pruned_loss=0.08924, over 21507.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.2873, pruned_loss=0.07009, over 4295645.68 frames. ], batch size: 194, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 10:43:09,383 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.20 vs. limit=15.0 2023-06-27 10:44:05,552 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1788492.0, ans=0.1 2023-06-27 10:44:20,763 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.58 vs. limit=15.0 2023-06-27 10:44:44,929 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1788552.0, ans=0.125 2023-06-27 10:44:47,507 INFO [train.py:996] (0/4) Epoch 10, batch 23650, loss[loss=0.2132, simple_loss=0.2899, pruned_loss=0.06831, over 21456.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2856, pruned_loss=0.06833, over 4283839.14 frames. ], batch size: 194, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 10:45:12,928 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1788672.0, ans=0.1 2023-06-27 10:45:32,960 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1788732.0, ans=0.125 2023-06-27 10:46:03,675 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1788792.0, ans=0.125 2023-06-27 10:46:27,253 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.326e+02 5.698e+02 8.154e+02 1.096e+03 2.339e+03, threshold=1.631e+03, percent-clipped=3.0 2023-06-27 10:46:38,534 INFO [train.py:996] (0/4) Epoch 10, batch 23700, loss[loss=0.2765, simple_loss=0.3362, pruned_loss=0.1084, over 21399.00 frames. ], tot_loss[loss=0.2133, simple_loss=0.2895, pruned_loss=0.06851, over 4282925.86 frames. ], batch size: 509, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 10:47:00,537 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1788972.0, ans=0.125 2023-06-27 10:47:42,492 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1789092.0, ans=0.125 2023-06-27 10:48:19,722 INFO [train.py:996] (0/4) Epoch 10, batch 23750, loss[loss=0.1814, simple_loss=0.2712, pruned_loss=0.04577, over 21324.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2917, pruned_loss=0.06854, over 4281701.02 frames. ], batch size: 176, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 10:48:29,288 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.94 vs. limit=12.0 2023-06-27 10:48:33,952 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.65 vs. limit=15.0 2023-06-27 10:49:04,841 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1789332.0, ans=0.0 2023-06-27 10:49:16,164 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1789332.0, ans=0.04949747468305833 2023-06-27 10:49:44,276 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1789452.0, ans=0.125 2023-06-27 10:49:55,491 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.670e+02 6.081e+02 7.830e+02 1.141e+03 2.559e+03, threshold=1.566e+03, percent-clipped=8.0 2023-06-27 10:50:02,087 INFO [train.py:996] (0/4) Epoch 10, batch 23800, loss[loss=0.2126, simple_loss=0.3122, pruned_loss=0.05648, over 21787.00 frames. ], tot_loss[loss=0.2121, simple_loss=0.2908, pruned_loss=0.06669, over 4274168.99 frames. ], batch size: 316, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 10:50:11,566 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.80 vs. limit=10.0 2023-06-27 10:50:49,719 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.38 vs. limit=15.0 2023-06-27 10:51:45,227 INFO [train.py:996] (0/4) Epoch 10, batch 23850, loss[loss=0.2152, simple_loss=0.2977, pruned_loss=0.06633, over 21469.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.3018, pruned_loss=0.06875, over 4265317.20 frames. ], batch size: 211, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 10:52:07,638 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1789872.0, ans=0.2 2023-06-27 10:53:06,266 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1789992.0, ans=0.125 2023-06-27 10:53:14,273 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1790052.0, ans=0.2 2023-06-27 10:53:18,318 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.153e+02 6.860e+02 1.142e+03 1.790e+03 3.579e+03, threshold=2.285e+03, percent-clipped=29.0 2023-06-27 10:53:24,733 INFO [train.py:996] (0/4) Epoch 10, batch 23900, loss[loss=0.2277, simple_loss=0.325, pruned_loss=0.06523, over 21694.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.3066, pruned_loss=0.0706, over 4270789.20 frames. ], batch size: 332, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 10:54:06,629 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1790172.0, ans=0.2 2023-06-27 10:54:36,324 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1790292.0, ans=0.0 2023-06-27 10:55:05,881 INFO [train.py:996] (0/4) Epoch 10, batch 23950, loss[loss=0.1912, simple_loss=0.2544, pruned_loss=0.06396, over 21364.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.3028, pruned_loss=0.07092, over 4276294.71 frames. ], batch size: 194, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 10:55:08,387 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1790412.0, ans=0.0 2023-06-27 10:55:11,894 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1790412.0, ans=0.125 2023-06-27 10:56:09,744 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1790532.0, ans=0.125 2023-06-27 10:56:40,582 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.159e+02 7.309e+02 9.584e+02 1.406e+03 2.703e+03, threshold=1.917e+03, percent-clipped=3.0 2023-06-27 10:56:47,121 INFO [train.py:996] (0/4) Epoch 10, batch 24000, loss[loss=0.259, simple_loss=0.327, pruned_loss=0.09551, over 21576.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.3042, pruned_loss=0.07378, over 4276701.91 frames. ], batch size: 415, lr: 2.89e-03, grad_scale: 32.0 2023-06-27 10:56:47,122 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-27 10:57:07,140 INFO [train.py:1028] (0/4) Epoch 10, validation: loss=0.2621, simple_loss=0.3549, pruned_loss=0.08461, over 1796401.00 frames. 2023-06-27 10:57:07,142 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23714MB 2023-06-27 10:58:45,462 INFO [train.py:996] (0/4) Epoch 10, batch 24050, loss[loss=0.1955, simple_loss=0.2887, pruned_loss=0.05118, over 21866.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.3057, pruned_loss=0.07459, over 4281362.37 frames. ], batch size: 316, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 10:59:29,751 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.97 vs. limit=15.0 2023-06-27 11:00:21,995 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.514e+02 5.749e+02 8.023e+02 1.325e+03 2.806e+03, threshold=1.605e+03, percent-clipped=11.0 2023-06-27 11:00:32,045 INFO [train.py:996] (0/4) Epoch 10, batch 24100, loss[loss=0.2544, simple_loss=0.3392, pruned_loss=0.08475, over 21576.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.3037, pruned_loss=0.07255, over 4275340.70 frames. ], batch size: 414, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:01:13,455 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1791432.0, ans=0.0 2023-06-27 11:01:16,426 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1791432.0, ans=0.125 2023-06-27 11:01:22,084 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.54 vs. limit=22.5 2023-06-27 11:01:41,436 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1791552.0, ans=0.125 2023-06-27 11:02:13,403 INFO [train.py:996] (0/4) Epoch 10, batch 24150, loss[loss=0.2647, simple_loss=0.3115, pruned_loss=0.1089, over 21777.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.3025, pruned_loss=0.07345, over 4280222.56 frames. ], batch size: 508, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:02:17,397 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1791612.0, ans=0.025 2023-06-27 11:02:37,686 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.40 vs. limit=15.0 2023-06-27 11:02:56,591 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1791732.0, ans=0.1 2023-06-27 11:03:33,867 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.52 vs. limit=10.0 2023-06-27 11:03:45,388 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.503e+02 6.384e+02 9.147e+02 1.297e+03 2.622e+03, threshold=1.829e+03, percent-clipped=12.0 2023-06-27 11:03:49,545 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1791912.0, ans=0.0 2023-06-27 11:03:50,483 INFO [train.py:996] (0/4) Epoch 10, batch 24200, loss[loss=0.2432, simple_loss=0.3275, pruned_loss=0.07945, over 21812.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.3051, pruned_loss=0.07493, over 4282299.46 frames. ], batch size: 333, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:03:51,289 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1791912.0, ans=0.125 2023-06-27 11:04:19,411 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1791972.0, ans=0.125 2023-06-27 11:05:33,111 INFO [train.py:996] (0/4) Epoch 10, batch 24250, loss[loss=0.1819, simple_loss=0.2825, pruned_loss=0.04067, over 21838.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.3022, pruned_loss=0.06968, over 4280982.60 frames. ], batch size: 316, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:05:41,854 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1792212.0, ans=0.125 2023-06-27 11:05:45,409 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1792212.0, ans=0.125 2023-06-27 11:06:11,557 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1792332.0, ans=0.125 2023-06-27 11:06:13,125 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1792332.0, ans=0.125 2023-06-27 11:06:32,496 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1792332.0, ans=0.2 2023-06-27 11:06:45,661 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1792392.0, ans=0.1 2023-06-27 11:07:09,242 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.373e+02 5.833e+02 9.026e+02 1.321e+03 2.992e+03, threshold=1.805e+03, percent-clipped=10.0 2023-06-27 11:07:14,010 INFO [train.py:996] (0/4) Epoch 10, batch 24300, loss[loss=0.1383, simple_loss=0.2221, pruned_loss=0.02729, over 21615.00 frames. ], tot_loss[loss=0.2133, simple_loss=0.2964, pruned_loss=0.06503, over 4274539.76 frames. ], batch size: 230, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:07:14,684 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1792512.0, ans=0.125 2023-06-27 11:07:27,727 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1792512.0, ans=0.125 2023-06-27 11:08:55,419 INFO [train.py:996] (0/4) Epoch 10, batch 24350, loss[loss=0.2314, simple_loss=0.3061, pruned_loss=0.07831, over 21483.00 frames. ], tot_loss[loss=0.2125, simple_loss=0.2947, pruned_loss=0.06518, over 4282367.16 frames. ], batch size: 548, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:09:12,950 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1792872.0, ans=0.125 2023-06-27 11:10:21,647 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1793052.0, ans=0.2 2023-06-27 11:10:27,622 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.762e+02 6.342e+02 9.950e+02 1.336e+03 3.105e+03, threshold=1.990e+03, percent-clipped=13.0 2023-06-27 11:10:28,248 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1793052.0, ans=0.0 2023-06-27 11:10:32,374 INFO [train.py:996] (0/4) Epoch 10, batch 24400, loss[loss=0.2126, simple_loss=0.2954, pruned_loss=0.06486, over 21710.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2976, pruned_loss=0.06829, over 4282955.74 frames. ], batch size: 351, lr: 2.89e-03, grad_scale: 32.0 2023-06-27 11:11:07,036 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1793172.0, ans=0.125 2023-06-27 11:11:58,317 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1793352.0, ans=0.09899494936611666 2023-06-27 11:12:10,057 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1793352.0, ans=0.125 2023-06-27 11:12:14,416 INFO [train.py:996] (0/4) Epoch 10, batch 24450, loss[loss=0.1992, simple_loss=0.2856, pruned_loss=0.05642, over 21273.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.298, pruned_loss=0.06953, over 4278452.97 frames. ], batch size: 176, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:13:01,971 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.56 vs. limit=6.0 2023-06-27 11:13:18,912 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1793532.0, ans=0.125 2023-06-27 11:13:25,434 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1793592.0, ans=0.125 2023-06-27 11:13:35,294 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1793592.0, ans=0.025 2023-06-27 11:13:50,877 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.345e+02 6.636e+02 9.241e+02 1.233e+03 3.193e+03, threshold=1.848e+03, percent-clipped=3.0 2023-06-27 11:13:54,212 INFO [train.py:996] (0/4) Epoch 10, batch 24500, loss[loss=0.207, simple_loss=0.2936, pruned_loss=0.06022, over 21267.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.2985, pruned_loss=0.06932, over 4279547.88 frames. ], batch size: 176, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:14:08,880 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1793712.0, ans=0.05 2023-06-27 11:14:19,003 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1793772.0, ans=0.125 2023-06-27 11:14:20,614 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1793772.0, ans=0.0 2023-06-27 11:14:44,677 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1793832.0, ans=0.125 2023-06-27 11:15:40,101 INFO [train.py:996] (0/4) Epoch 10, batch 24550, loss[loss=0.2337, simple_loss=0.321, pruned_loss=0.07316, over 21600.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.3012, pruned_loss=0.0708, over 4275908.82 frames. ], batch size: 389, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:15:53,776 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.61 vs. limit=6.0 2023-06-27 11:16:16,015 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.92 vs. limit=15.0 2023-06-27 11:16:33,779 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1794132.0, ans=0.125 2023-06-27 11:16:43,793 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1794192.0, ans=0.0 2023-06-27 11:17:16,637 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.256e+02 6.446e+02 9.214e+02 1.322e+03 3.260e+03, threshold=1.843e+03, percent-clipped=13.0 2023-06-27 11:17:19,811 INFO [train.py:996] (0/4) Epoch 10, batch 24600, loss[loss=0.2299, simple_loss=0.315, pruned_loss=0.07237, over 21493.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.2993, pruned_loss=0.07079, over 4268094.95 frames. ], batch size: 211, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:17:57,562 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1794372.0, ans=0.0 2023-06-27 11:18:09,587 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.45 vs. limit=22.5 2023-06-27 11:18:13,071 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.76 vs. limit=5.0 2023-06-27 11:18:20,977 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 11:18:25,857 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1794492.0, ans=0.1 2023-06-27 11:19:06,362 INFO [train.py:996] (0/4) Epoch 10, batch 24650, loss[loss=0.1922, simple_loss=0.2631, pruned_loss=0.06065, over 21309.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2919, pruned_loss=0.06976, over 4273018.55 frames. ], batch size: 131, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:19:27,487 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.65 vs. limit=15.0 2023-06-27 11:19:41,547 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1794672.0, ans=0.0 2023-06-27 11:20:21,586 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1794852.0, ans=0.0 2023-06-27 11:20:38,902 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.116e+02 6.411e+02 8.563e+02 1.154e+03 3.780e+03, threshold=1.713e+03, percent-clipped=12.0 2023-06-27 11:20:42,279 INFO [train.py:996] (0/4) Epoch 10, batch 24700, loss[loss=0.2037, simple_loss=0.2815, pruned_loss=0.06297, over 21789.00 frames. ], tot_loss[loss=0.213, simple_loss=0.2889, pruned_loss=0.06854, over 4275231.63 frames. ], batch size: 317, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:22:16,081 INFO [train.py:996] (0/4) Epoch 10, batch 24750, loss[loss=0.1648, simple_loss=0.2317, pruned_loss=0.04895, over 21609.00 frames. ], tot_loss[loss=0.2073, simple_loss=0.2817, pruned_loss=0.06642, over 4270653.99 frames. ], batch size: 263, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:22:52,485 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 11:23:14,959 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1795392.0, ans=0.05 2023-06-27 11:23:31,248 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1795452.0, ans=0.125 2023-06-27 11:23:37,692 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1795452.0, ans=0.1 2023-06-27 11:23:38,627 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.117e+02 6.853e+02 9.586e+02 1.478e+03 3.032e+03, threshold=1.917e+03, percent-clipped=13.0 2023-06-27 11:23:45,814 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1795512.0, ans=0.0 2023-06-27 11:23:46,726 INFO [train.py:996] (0/4) Epoch 10, batch 24800, loss[loss=0.1978, simple_loss=0.2674, pruned_loss=0.06414, over 21761.00 frames. ], tot_loss[loss=0.2036, simple_loss=0.276, pruned_loss=0.06562, over 4279147.30 frames. ], batch size: 247, lr: 2.89e-03, grad_scale: 32.0 2023-06-27 11:23:57,274 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1795512.0, ans=0.125 2023-06-27 11:24:12,496 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1795572.0, ans=0.2 2023-06-27 11:25:29,260 INFO [train.py:996] (0/4) Epoch 10, batch 24850, loss[loss=0.2001, simple_loss=0.2773, pruned_loss=0.06142, over 21857.00 frames. ], tot_loss[loss=0.2052, simple_loss=0.2765, pruned_loss=0.06689, over 4281190.07 frames. ], batch size: 298, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:26:24,947 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1795992.0, ans=0.125 2023-06-27 11:26:59,061 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.781e+02 6.964e+02 9.660e+02 1.513e+03 3.423e+03, threshold=1.932e+03, percent-clipped=14.0 2023-06-27 11:27:00,594 INFO [train.py:996] (0/4) Epoch 10, batch 24900, loss[loss=0.2358, simple_loss=0.3164, pruned_loss=0.07763, over 21699.00 frames. ], tot_loss[loss=0.2075, simple_loss=0.2794, pruned_loss=0.0678, over 4285743.75 frames. ], batch size: 351, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:27:51,183 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.74 vs. limit=15.0 2023-06-27 11:27:58,946 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.55 vs. limit=15.0 2023-06-27 11:28:41,464 INFO [train.py:996] (0/4) Epoch 10, batch 24950, loss[loss=0.2301, simple_loss=0.3071, pruned_loss=0.0766, over 21672.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2878, pruned_loss=0.07129, over 4289608.05 frames. ], batch size: 263, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:28:50,898 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.82 vs. limit=15.0 2023-06-27 11:29:08,578 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1796472.0, ans=0.125 2023-06-27 11:29:16,585 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1796472.0, ans=0.1 2023-06-27 11:29:16,668 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_na.min_abs, batch_count=1796472.0, ans=0.02 2023-06-27 11:29:29,623 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1796532.0, ans=0.0 2023-06-27 11:29:59,854 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.53 vs. limit=15.0 2023-06-27 11:30:12,726 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=1796652.0, ans=15.0 2023-06-27 11:30:19,516 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.654e+02 6.926e+02 9.542e+02 1.348e+03 3.788e+03, threshold=1.908e+03, percent-clipped=7.0 2023-06-27 11:30:20,989 INFO [train.py:996] (0/4) Epoch 10, batch 25000, loss[loss=0.189, simple_loss=0.2646, pruned_loss=0.0567, over 21479.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.293, pruned_loss=0.07256, over 4284112.01 frames. ], batch size: 194, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:30:28,433 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.12 vs. limit=15.0 2023-06-27 11:30:36,149 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1796712.0, ans=0.2 2023-06-27 11:30:37,873 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1796712.0, ans=0.0 2023-06-27 11:30:43,376 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.71 vs. limit=15.0 2023-06-27 11:31:40,063 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1796892.0, ans=0.125 2023-06-27 11:32:05,931 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1796952.0, ans=0.1 2023-06-27 11:32:11,878 INFO [train.py:996] (0/4) Epoch 10, batch 25050, loss[loss=0.2023, simple_loss=0.2706, pruned_loss=0.06702, over 21796.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.287, pruned_loss=0.07101, over 4280759.32 frames. ], batch size: 352, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:32:49,432 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1797132.0, ans=0.0 2023-06-27 11:33:20,632 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.68 vs. limit=8.0 2023-06-27 11:33:50,161 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.906e+02 5.494e+02 7.889e+02 1.087e+03 2.340e+03, threshold=1.578e+03, percent-clipped=4.0 2023-06-27 11:33:51,533 INFO [train.py:996] (0/4) Epoch 10, batch 25100, loss[loss=0.1978, simple_loss=0.2896, pruned_loss=0.05305, over 21753.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2818, pruned_loss=0.06932, over 4283241.27 frames. ], batch size: 282, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:34:03,404 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1797312.0, ans=0.125 2023-06-27 11:34:21,540 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.67 vs. limit=10.0 2023-06-27 11:34:44,860 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1797492.0, ans=0.0 2023-06-27 11:34:56,264 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1797492.0, ans=0.1 2023-06-27 11:35:14,034 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1797552.0, ans=0.0 2023-06-27 11:35:15,559 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1797552.0, ans=0.125 2023-06-27 11:35:26,562 INFO [train.py:996] (0/4) Epoch 10, batch 25150, loss[loss=0.2039, simple_loss=0.2928, pruned_loss=0.05746, over 21361.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.2844, pruned_loss=0.06738, over 4279294.87 frames. ], batch size: 211, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:37:04,859 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.938e+02 6.755e+02 1.265e+03 1.654e+03 3.292e+03, threshold=2.530e+03, percent-clipped=31.0 2023-06-27 11:37:06,427 INFO [train.py:996] (0/4) Epoch 10, batch 25200, loss[loss=0.1753, simple_loss=0.2638, pruned_loss=0.04334, over 17029.00 frames. ], tot_loss[loss=0.2076, simple_loss=0.2846, pruned_loss=0.06524, over 4269804.04 frames. ], batch size: 65, lr: 2.89e-03, grad_scale: 32.0 2023-06-27 11:38:03,052 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=13.05 vs. limit=15.0 2023-06-27 11:38:19,964 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1798092.0, ans=0.2 2023-06-27 11:38:40,333 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1798152.0, ans=0.09899494936611666 2023-06-27 11:38:46,163 INFO [train.py:996] (0/4) Epoch 10, batch 25250, loss[loss=0.1843, simple_loss=0.2765, pruned_loss=0.04602, over 21656.00 frames. ], tot_loss[loss=0.2058, simple_loss=0.2833, pruned_loss=0.06418, over 4262982.91 frames. ], batch size: 263, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:39:14,135 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.02 vs. limit=15.0 2023-06-27 11:39:42,371 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1798332.0, ans=0.125 2023-06-27 11:39:44,606 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.90 vs. limit=15.0 2023-06-27 11:40:18,184 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1798452.0, ans=0.0 2023-06-27 11:40:32,592 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.222e+02 7.212e+02 1.026e+03 1.530e+03 2.488e+03, threshold=2.053e+03, percent-clipped=0.0 2023-06-27 11:40:32,622 INFO [train.py:996] (0/4) Epoch 10, batch 25300, loss[loss=0.1943, simple_loss=0.2599, pruned_loss=0.06442, over 21303.00 frames. ], tot_loss[loss=0.2036, simple_loss=0.2803, pruned_loss=0.06347, over 4252059.60 frames. ], batch size: 144, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:40:55,978 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1798572.0, ans=0.125 2023-06-27 11:41:12,954 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.58 vs. limit=15.0 2023-06-27 11:41:13,798 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1798632.0, ans=0.125 2023-06-27 11:41:27,572 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=6.71 vs. limit=22.5 2023-06-27 11:42:13,847 INFO [train.py:996] (0/4) Epoch 10, batch 25350, loss[loss=0.1772, simple_loss=0.273, pruned_loss=0.04065, over 20793.00 frames. ], tot_loss[loss=0.2039, simple_loss=0.282, pruned_loss=0.06295, over 4250871.54 frames. ], batch size: 608, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:42:17,598 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1798812.0, ans=0.125 2023-06-27 11:42:50,028 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1798932.0, ans=0.0 2023-06-27 11:43:53,123 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.945e+02 6.520e+02 8.858e+02 1.308e+03 2.699e+03, threshold=1.772e+03, percent-clipped=4.0 2023-06-27 11:43:53,153 INFO [train.py:996] (0/4) Epoch 10, batch 25400, loss[loss=0.2089, simple_loss=0.2839, pruned_loss=0.06697, over 21836.00 frames. ], tot_loss[loss=0.2022, simple_loss=0.2792, pruned_loss=0.06259, over 4256750.98 frames. ], batch size: 107, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:44:08,848 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1799172.0, ans=0.95 2023-06-27 11:44:12,020 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1799172.0, ans=0.125 2023-06-27 11:44:38,495 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.48 vs. limit=15.0 2023-06-27 11:44:58,823 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1799292.0, ans=0.125 2023-06-27 11:45:00,548 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1799292.0, ans=0.5 2023-06-27 11:45:00,589 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1799292.0, ans=0.125 2023-06-27 11:45:21,824 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1799352.0, ans=0.1 2023-06-27 11:45:24,191 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.97 vs. limit=12.0 2023-06-27 11:45:34,081 INFO [train.py:996] (0/4) Epoch 10, batch 25450, loss[loss=0.2068, simple_loss=0.2757, pruned_loss=0.06894, over 20048.00 frames. ], tot_loss[loss=0.2034, simple_loss=0.2795, pruned_loss=0.06365, over 4257832.65 frames. ], batch size: 702, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 11:45:38,352 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1799412.0, ans=0.125 2023-06-27 11:45:44,763 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1799412.0, ans=0.125 2023-06-27 11:45:49,748 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1799472.0, ans=0.125 2023-06-27 11:46:45,469 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1799592.0, ans=0.0 2023-06-27 11:47:16,317 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.168e+02 6.039e+02 8.121e+02 1.135e+03 2.521e+03, threshold=1.624e+03, percent-clipped=2.0 2023-06-27 11:47:16,356 INFO [train.py:996] (0/4) Epoch 10, batch 25500, loss[loss=0.1778, simple_loss=0.26, pruned_loss=0.04779, over 21268.00 frames. ], tot_loss[loss=0.2017, simple_loss=0.2799, pruned_loss=0.06179, over 4247679.26 frames. ], batch size: 176, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 11:48:43,060 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.47 vs. limit=15.0 2023-06-27 11:48:52,300 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-300000.pt 2023-06-27 11:48:58,446 INFO [train.py:996] (0/4) Epoch 10, batch 25550, loss[loss=0.2121, simple_loss=0.2883, pruned_loss=0.06793, over 15573.00 frames. ], tot_loss[loss=0.2049, simple_loss=0.2863, pruned_loss=0.06173, over 4231744.71 frames. ], batch size: 60, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 11:49:04,117 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1800012.0, ans=0.125 2023-06-27 11:49:08,887 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1800012.0, ans=0.125 2023-06-27 11:50:03,070 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.20 vs. limit=15.0 2023-06-27 11:50:29,882 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1800252.0, ans=0.2 2023-06-27 11:50:31,758 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1800252.0, ans=0.125 2023-06-27 11:50:38,210 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1800312.0, ans=0.1 2023-06-27 11:50:39,239 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.324e+02 5.995e+02 1.017e+03 1.623e+03 5.096e+03, threshold=2.035e+03, percent-clipped=24.0 2023-06-27 11:50:39,270 INFO [train.py:996] (0/4) Epoch 10, batch 25600, loss[loss=0.2385, simple_loss=0.3096, pruned_loss=0.08365, over 21844.00 frames. ], tot_loss[loss=0.2084, simple_loss=0.2907, pruned_loss=0.06302, over 4237868.82 frames. ], batch size: 247, lr: 2.88e-03, grad_scale: 32.0 2023-06-27 11:51:08,968 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1800372.0, ans=0.1 2023-06-27 11:52:19,297 INFO [train.py:996] (0/4) Epoch 10, batch 25650, loss[loss=0.2189, simple_loss=0.2842, pruned_loss=0.07685, over 21839.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.291, pruned_loss=0.06559, over 4246492.20 frames. ], batch size: 98, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 11:52:36,146 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1800612.0, ans=0.0 2023-06-27 11:53:42,913 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1800852.0, ans=0.125 2023-06-27 11:54:00,583 INFO [train.py:996] (0/4) Epoch 10, batch 25700, loss[loss=0.2745, simple_loss=0.3336, pruned_loss=0.1077, over 21651.00 frames. ], tot_loss[loss=0.21, simple_loss=0.2872, pruned_loss=0.0664, over 4255645.43 frames. ], batch size: 508, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 11:54:06,859 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.076e+02 8.398e+02 1.386e+03 2.056e+03 4.305e+03, threshold=2.773e+03, percent-clipped=25.0 2023-06-27 11:54:12,477 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=1800912.0, ans=10.0 2023-06-27 11:54:48,040 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1801032.0, ans=0.1 2023-06-27 11:54:49,592 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1801032.0, ans=0.125 2023-06-27 11:55:00,655 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1801032.0, ans=0.125 2023-06-27 11:55:03,231 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.82 vs. limit=15.0 2023-06-27 11:55:12,925 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.09 vs. limit=15.0 2023-06-27 11:55:46,717 INFO [train.py:996] (0/4) Epoch 10, batch 25750, loss[loss=0.2637, simple_loss=0.3691, pruned_loss=0.07917, over 20764.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2922, pruned_loss=0.06898, over 4251887.98 frames. ], batch size: 608, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 11:55:54,260 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.23 vs. limit=6.0 2023-06-27 11:56:53,865 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.68 vs. limit=22.5 2023-06-27 11:57:00,758 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.76 vs. limit=10.0 2023-06-27 11:57:39,563 INFO [train.py:996] (0/4) Epoch 10, batch 25800, loss[loss=0.2179, simple_loss=0.2973, pruned_loss=0.0693, over 21812.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.3029, pruned_loss=0.07243, over 4262712.57 frames. ], batch size: 282, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 11:57:41,427 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.492e+02 7.009e+02 1.091e+03 1.518e+03 3.688e+03, threshold=2.182e+03, percent-clipped=4.0 2023-06-27 11:57:47,633 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.89 vs. limit=15.0 2023-06-27 11:58:44,432 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1801692.0, ans=0.05 2023-06-27 11:59:00,371 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1801692.0, ans=0.0 2023-06-27 11:59:23,906 INFO [train.py:996] (0/4) Epoch 10, batch 25850, loss[loss=0.2237, simple_loss=0.2927, pruned_loss=0.07734, over 21851.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.3073, pruned_loss=0.07358, over 4260741.79 frames. ], batch size: 118, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 11:59:44,042 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1801872.0, ans=0.0 2023-06-27 12:00:06,108 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1801932.0, ans=0.125 2023-06-27 12:00:15,871 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1801932.0, ans=0.125 2023-06-27 12:01:03,668 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1802052.0, ans=0.0 2023-06-27 12:01:10,659 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1802112.0, ans=0.125 2023-06-27 12:01:11,684 INFO [train.py:996] (0/4) Epoch 10, batch 25900, loss[loss=0.2476, simple_loss=0.3438, pruned_loss=0.0757, over 21667.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.308, pruned_loss=0.07369, over 4268658.79 frames. ], batch size: 263, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:01:12,365 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1802112.0, ans=0.125 2023-06-27 12:01:13,359 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.315e+02 6.367e+02 8.577e+02 1.335e+03 4.211e+03, threshold=1.715e+03, percent-clipped=7.0 2023-06-27 12:01:23,809 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1802112.0, ans=0.125 2023-06-27 12:01:35,559 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1802172.0, ans=0.125 2023-06-27 12:02:37,945 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.46 vs. limit=15.0 2023-06-27 12:02:53,656 INFO [train.py:996] (0/4) Epoch 10, batch 25950, loss[loss=0.2618, simple_loss=0.3359, pruned_loss=0.0938, over 21588.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3134, pruned_loss=0.07589, over 4273443.16 frames. ], batch size: 389, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:02:59,243 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1802412.0, ans=0.125 2023-06-27 12:03:49,390 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1802532.0, ans=0.0 2023-06-27 12:03:54,279 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1802532.0, ans=0.2 2023-06-27 12:04:04,330 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1802592.0, ans=0.2 2023-06-27 12:04:16,290 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1802592.0, ans=0.125 2023-06-27 12:04:21,182 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1802652.0, ans=0.125 2023-06-27 12:04:28,036 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1802652.0, ans=0.0 2023-06-27 12:04:35,378 INFO [train.py:996] (0/4) Epoch 10, batch 26000, loss[loss=0.2357, simple_loss=0.3264, pruned_loss=0.07244, over 21624.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3126, pruned_loss=0.07405, over 4277476.95 frames. ], batch size: 389, lr: 2.88e-03, grad_scale: 32.0 2023-06-27 12:04:37,160 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.008e+02 6.185e+02 7.875e+02 1.125e+03 3.104e+03, threshold=1.575e+03, percent-clipped=8.0 2023-06-27 12:04:39,284 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1802712.0, ans=0.125 2023-06-27 12:04:44,369 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1802712.0, ans=0.125 2023-06-27 12:05:22,145 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.16 vs. limit=6.0 2023-06-27 12:05:34,683 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1802832.0, ans=0.125 2023-06-27 12:05:34,720 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1802832.0, ans=0.125 2023-06-27 12:06:02,493 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1802952.0, ans=0.2 2023-06-27 12:06:11,935 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1802952.0, ans=0.125 2023-06-27 12:06:16,103 INFO [train.py:996] (0/4) Epoch 10, batch 26050, loss[loss=0.2106, simple_loss=0.2787, pruned_loss=0.07122, over 21660.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3122, pruned_loss=0.07431, over 4273252.49 frames. ], batch size: 263, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:06:28,066 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1803012.0, ans=0.0 2023-06-27 12:07:50,607 INFO [train.py:996] (0/4) Epoch 10, batch 26100, loss[loss=0.2035, simple_loss=0.2696, pruned_loss=0.06872, over 21452.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3074, pruned_loss=0.07472, over 4283858.92 frames. ], batch size: 211, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:07:53,722 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.444e+02 6.064e+02 8.418e+02 1.151e+03 2.910e+03, threshold=1.684e+03, percent-clipped=10.0 2023-06-27 12:08:19,761 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1803372.0, ans=0.0 2023-06-27 12:08:53,679 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1803492.0, ans=0.0 2023-06-27 12:09:03,409 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1803492.0, ans=0.1 2023-06-27 12:09:30,845 INFO [train.py:996] (0/4) Epoch 10, batch 26150, loss[loss=0.2415, simple_loss=0.3212, pruned_loss=0.08089, over 21588.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3042, pruned_loss=0.07428, over 4290145.44 frames. ], batch size: 414, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:10:15,509 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1803672.0, ans=0.0 2023-06-27 12:10:16,266 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.22 vs. limit=22.5 2023-06-27 12:10:17,990 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.75 vs. limit=15.0 2023-06-27 12:10:43,651 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.53 vs. limit=15.0 2023-06-27 12:10:44,663 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1803792.0, ans=0.0 2023-06-27 12:10:59,806 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1803852.0, ans=0.1 2023-06-27 12:11:16,875 INFO [train.py:996] (0/4) Epoch 10, batch 26200, loss[loss=0.2367, simple_loss=0.3462, pruned_loss=0.0636, over 21256.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.3045, pruned_loss=0.07241, over 4286643.16 frames. ], batch size: 548, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:11:20,493 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.988e+02 7.097e+02 1.092e+03 1.637e+03 2.606e+03, threshold=2.184e+03, percent-clipped=21.0 2023-06-27 12:12:30,461 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1804152.0, ans=0.125 2023-06-27 12:12:32,313 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1804152.0, ans=0.0 2023-06-27 12:12:52,756 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1804152.0, ans=0.09899494936611666 2023-06-27 12:12:54,813 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.81 vs. limit=15.0 2023-06-27 12:12:56,972 INFO [train.py:996] (0/4) Epoch 10, batch 26250, loss[loss=0.2223, simple_loss=0.3013, pruned_loss=0.0717, over 21606.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.307, pruned_loss=0.07096, over 4291812.51 frames. ], batch size: 548, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:13:21,250 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1804272.0, ans=0.0 2023-06-27 12:14:03,325 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1804392.0, ans=0.125 2023-06-27 12:14:04,988 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1804392.0, ans=0.0 2023-06-27 12:14:36,333 INFO [train.py:996] (0/4) Epoch 10, batch 26300, loss[loss=0.2071, simple_loss=0.2809, pruned_loss=0.06668, over 21925.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.3036, pruned_loss=0.07154, over 4292091.08 frames. ], batch size: 351, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:14:39,661 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.217e+02 5.994e+02 7.746e+02 1.132e+03 2.553e+03, threshold=1.549e+03, percent-clipped=2.0 2023-06-27 12:14:41,894 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1804512.0, ans=0.0 2023-06-27 12:14:53,102 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1804512.0, ans=0.0 2023-06-27 12:14:53,191 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1804512.0, ans=0.0 2023-06-27 12:15:01,188 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1804572.0, ans=0.125 2023-06-27 12:15:25,626 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1804632.0, ans=0.0 2023-06-27 12:15:32,316 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1804632.0, ans=0.125 2023-06-27 12:15:38,939 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1804692.0, ans=0.1 2023-06-27 12:15:58,481 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1804752.0, ans=0.1 2023-06-27 12:16:15,734 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1804812.0, ans=0.125 2023-06-27 12:16:16,767 INFO [train.py:996] (0/4) Epoch 10, batch 26350, loss[loss=0.2466, simple_loss=0.3222, pruned_loss=0.08554, over 21680.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.3016, pruned_loss=0.07168, over 4291522.76 frames. ], batch size: 351, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:16:19,683 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.11 vs. limit=15.0 2023-06-27 12:16:29,109 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1804812.0, ans=0.0 2023-06-27 12:16:30,577 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1804812.0, ans=0.2 2023-06-27 12:17:04,320 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1804932.0, ans=0.125 2023-06-27 12:17:51,179 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1805112.0, ans=0.125 2023-06-27 12:17:51,989 INFO [train.py:996] (0/4) Epoch 10, batch 26400, loss[loss=0.1918, simple_loss=0.2547, pruned_loss=0.06446, over 21578.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2966, pruned_loss=0.07198, over 4281697.03 frames. ], batch size: 231, lr: 2.88e-03, grad_scale: 32.0 2023-06-27 12:17:55,527 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.612e+02 7.254e+02 1.118e+03 1.690e+03 3.507e+03, threshold=2.236e+03, percent-clipped=29.0 2023-06-27 12:18:33,660 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1805232.0, ans=0.2 2023-06-27 12:18:42,637 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1805232.0, ans=0.1 2023-06-27 12:19:16,450 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1805292.0, ans=0.125 2023-06-27 12:19:23,237 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1805352.0, ans=0.1 2023-06-27 12:19:36,211 INFO [train.py:996] (0/4) Epoch 10, batch 26450, loss[loss=0.2347, simple_loss=0.3262, pruned_loss=0.0716, over 21685.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.2958, pruned_loss=0.07164, over 4276049.77 frames. ], batch size: 247, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:19:40,851 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.26 vs. limit=12.0 2023-06-27 12:19:43,581 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1805412.0, ans=0.125 2023-06-27 12:19:46,096 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.22 vs. limit=15.0 2023-06-27 12:19:47,337 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1805412.0, ans=0.125 2023-06-27 12:19:58,633 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1805472.0, ans=0.1 2023-06-27 12:20:03,539 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1805472.0, ans=0.04949747468305833 2023-06-27 12:20:14,252 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.23 vs. limit=10.0 2023-06-27 12:20:23,362 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1805532.0, ans=0.1 2023-06-27 12:20:33,103 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.76 vs. limit=12.0 2023-06-27 12:20:41,745 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.45 vs. limit=10.0 2023-06-27 12:21:04,988 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.08 vs. limit=6.0 2023-06-27 12:21:09,325 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1805652.0, ans=0.125 2023-06-27 12:21:09,438 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1805652.0, ans=0.0 2023-06-27 12:21:19,551 INFO [train.py:996] (0/4) Epoch 10, batch 26500, loss[loss=0.2578, simple_loss=0.3413, pruned_loss=0.08712, over 21681.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2984, pruned_loss=0.07113, over 4267257.80 frames. ], batch size: 441, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:21:28,834 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.787e+02 8.473e+02 1.317e+03 2.228e+03 4.940e+03, threshold=2.635e+03, percent-clipped=24.0 2023-06-27 12:21:40,876 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.19 vs. limit=15.0 2023-06-27 12:22:36,815 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1805892.0, ans=0.125 2023-06-27 12:23:07,846 INFO [train.py:996] (0/4) Epoch 10, batch 26550, loss[loss=0.1651, simple_loss=0.2476, pruned_loss=0.04134, over 21558.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.2959, pruned_loss=0.06847, over 4267414.51 frames. ], batch size: 212, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:23:20,150 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1806012.0, ans=0.2 2023-06-27 12:24:28,528 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1806192.0, ans=0.1 2023-06-27 12:24:53,640 INFO [train.py:996] (0/4) Epoch 10, batch 26600, loss[loss=0.1925, simple_loss=0.2768, pruned_loss=0.05407, over 21280.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.2959, pruned_loss=0.06574, over 4271624.34 frames. ], batch size: 176, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:25:02,997 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.900e+02 9.676e+02 1.340e+03 1.727e+03 3.782e+03, threshold=2.679e+03, percent-clipped=7.0 2023-06-27 12:26:38,849 INFO [train.py:996] (0/4) Epoch 10, batch 26650, loss[loss=0.167, simple_loss=0.2409, pruned_loss=0.04657, over 21145.00 frames. ], tot_loss[loss=0.2095, simple_loss=0.2891, pruned_loss=0.06497, over 4270875.92 frames. ], batch size: 159, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:26:55,313 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1806612.0, ans=0.125 2023-06-27 12:26:57,395 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.48 vs. limit=15.0 2023-06-27 12:27:08,372 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1806672.0, ans=0.1 2023-06-27 12:27:25,949 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1806732.0, ans=0.125 2023-06-27 12:27:29,156 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1806732.0, ans=0.125 2023-06-27 12:27:46,500 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1806792.0, ans=0.0 2023-06-27 12:27:56,620 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1806852.0, ans=0.0 2023-06-27 12:28:18,264 INFO [train.py:996] (0/4) Epoch 10, batch 26700, loss[loss=0.2189, simple_loss=0.2917, pruned_loss=0.07304, over 21877.00 frames. ], tot_loss[loss=0.2044, simple_loss=0.2832, pruned_loss=0.06282, over 4271337.29 frames. ], batch size: 351, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:28:23,315 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.270e+02 4.740e+02 5.974e+02 7.751e+02 2.095e+03, threshold=1.195e+03, percent-clipped=0.0 2023-06-27 12:28:46,567 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1806972.0, ans=0.1 2023-06-27 12:29:03,351 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.47 vs. limit=15.0 2023-06-27 12:29:59,502 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1807152.0, ans=0.1 2023-06-27 12:30:03,644 INFO [train.py:996] (0/4) Epoch 10, batch 26750, loss[loss=0.2347, simple_loss=0.3173, pruned_loss=0.07599, over 21359.00 frames. ], tot_loss[loss=0.2035, simple_loss=0.2826, pruned_loss=0.06215, over 4274158.66 frames. ], batch size: 548, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:30:19,638 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.63 vs. limit=15.0 2023-06-27 12:31:45,852 INFO [train.py:996] (0/4) Epoch 10, batch 26800, loss[loss=0.2791, simple_loss=0.3378, pruned_loss=0.1103, over 21415.00 frames. ], tot_loss[loss=0.212, simple_loss=0.2914, pruned_loss=0.0663, over 4273535.37 frames. ], batch size: 471, lr: 2.88e-03, grad_scale: 32.0 2023-06-27 12:31:51,192 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.485e+02 8.253e+02 1.353e+03 2.004e+03 3.922e+03, threshold=2.706e+03, percent-clipped=54.0 2023-06-27 12:31:51,741 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1807512.0, ans=0.0 2023-06-27 12:31:58,254 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1807512.0, ans=0.125 2023-06-27 12:32:48,254 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1807692.0, ans=0.0 2023-06-27 12:33:09,851 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1807752.0, ans=0.0 2023-06-27 12:33:10,417 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.43 vs. limit=22.5 2023-06-27 12:33:27,233 INFO [train.py:996] (0/4) Epoch 10, batch 26850, loss[loss=0.1973, simple_loss=0.267, pruned_loss=0.06377, over 21893.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2934, pruned_loss=0.06865, over 4271127.61 frames. ], batch size: 107, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:33:32,809 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1807812.0, ans=0.1 2023-06-27 12:34:46,097 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 12:35:04,339 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1808052.0, ans=0.1 2023-06-27 12:35:04,341 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1808052.0, ans=0.2 2023-06-27 12:35:07,056 INFO [train.py:996] (0/4) Epoch 10, batch 26900, loss[loss=0.202, simple_loss=0.2564, pruned_loss=0.07379, over 21506.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.2851, pruned_loss=0.06751, over 4271140.74 frames. ], batch size: 442, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:35:13,696 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.943e+02 6.437e+02 8.362e+02 1.264e+03 2.899e+03, threshold=1.672e+03, percent-clipped=1.0 2023-06-27 12:36:30,736 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1808352.0, ans=0.0 2023-06-27 12:36:45,520 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1808412.0, ans=0.025 2023-06-27 12:36:46,477 INFO [train.py:996] (0/4) Epoch 10, batch 26950, loss[loss=0.2037, simple_loss=0.291, pruned_loss=0.05818, over 19723.00 frames. ], tot_loss[loss=0.2103, simple_loss=0.2849, pruned_loss=0.06784, over 4273582.76 frames. ], batch size: 702, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:36:58,687 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 12:36:59,293 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.39 vs. limit=10.0 2023-06-27 12:37:45,140 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1808532.0, ans=0.125 2023-06-27 12:38:09,556 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1808652.0, ans=0.1 2023-06-27 12:38:27,748 INFO [train.py:996] (0/4) Epoch 10, batch 27000, loss[loss=0.2162, simple_loss=0.2773, pruned_loss=0.07758, over 20126.00 frames. ], tot_loss[loss=0.2085, simple_loss=0.2852, pruned_loss=0.06591, over 4276288.32 frames. ], batch size: 703, lr: 2.88e-03, grad_scale: 8.0 2023-06-27 12:38:27,750 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-27 12:38:47,561 INFO [train.py:1028] (0/4) Epoch 10, validation: loss=0.2474, simple_loss=0.3368, pruned_loss=0.07904, over 1796401.00 frames. 2023-06-27 12:38:47,562 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23714MB 2023-06-27 12:39:01,426 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.008e+02 5.840e+02 8.267e+02 1.216e+03 2.372e+03, threshold=1.653e+03, percent-clipped=7.0 2023-06-27 12:39:22,050 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1808772.0, ans=0.1 2023-06-27 12:39:45,122 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1808832.0, ans=0.2 2023-06-27 12:40:03,914 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.74 vs. limit=15.0 2023-06-27 12:40:14,633 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1808952.0, ans=0.125 2023-06-27 12:40:21,377 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.28 vs. limit=15.0 2023-06-27 12:40:29,861 INFO [train.py:996] (0/4) Epoch 10, batch 27050, loss[loss=0.2082, simple_loss=0.2866, pruned_loss=0.0649, over 21609.00 frames. ], tot_loss[loss=0.2062, simple_loss=0.2859, pruned_loss=0.06324, over 4277312.22 frames. ], batch size: 230, lr: 2.88e-03, grad_scale: 8.0 2023-06-27 12:40:51,139 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1809072.0, ans=0.0 2023-06-27 12:40:52,613 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1809072.0, ans=0.0 2023-06-27 12:40:52,659 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1809072.0, ans=0.0 2023-06-27 12:41:37,938 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1809192.0, ans=0.125 2023-06-27 12:42:10,040 INFO [train.py:996] (0/4) Epoch 10, batch 27100, loss[loss=0.2091, simple_loss=0.3064, pruned_loss=0.0559, over 21460.00 frames. ], tot_loss[loss=0.2083, simple_loss=0.2878, pruned_loss=0.0644, over 4284131.89 frames. ], batch size: 211, lr: 2.88e-03, grad_scale: 8.0 2023-06-27 12:42:22,704 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.015e+02 5.614e+02 8.365e+02 1.169e+03 2.643e+03, threshold=1.673e+03, percent-clipped=10.0 2023-06-27 12:42:25,022 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1809312.0, ans=0.125 2023-06-27 12:43:12,999 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1809492.0, ans=0.1 2023-06-27 12:43:21,149 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1809492.0, ans=0.2 2023-06-27 12:43:51,701 INFO [train.py:996] (0/4) Epoch 10, batch 27150, loss[loss=0.2148, simple_loss=0.3099, pruned_loss=0.05985, over 21674.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.299, pruned_loss=0.06722, over 4286589.04 frames. ], batch size: 263, lr: 2.88e-03, grad_scale: 8.0 2023-06-27 12:44:29,801 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1809672.0, ans=0.125 2023-06-27 12:44:29,816 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1809672.0, ans=0.125 2023-06-27 12:45:03,641 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff3.min_abs, batch_count=1809792.0, ans=0.2 2023-06-27 12:45:37,927 INFO [train.py:996] (0/4) Epoch 10, batch 27200, loss[loss=0.244, simple_loss=0.3135, pruned_loss=0.08721, over 21592.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.3065, pruned_loss=0.06982, over 4285201.04 frames. ], batch size: 230, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:45:39,305 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.05 vs. limit=15.0 2023-06-27 12:45:41,855 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1809912.0, ans=0.0 2023-06-27 12:45:50,768 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.524e+02 6.689e+02 1.006e+03 1.593e+03 2.972e+03, threshold=2.013e+03, percent-clipped=22.0 2023-06-27 12:45:56,546 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1809912.0, ans=0.2 2023-06-27 12:47:02,338 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.96 vs. limit=22.5 2023-06-27 12:47:06,493 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1810152.0, ans=0.035 2023-06-27 12:47:19,000 INFO [train.py:996] (0/4) Epoch 10, batch 27250, loss[loss=0.2221, simple_loss=0.2962, pruned_loss=0.07406, over 20610.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3096, pruned_loss=0.07358, over 4283839.12 frames. ], batch size: 607, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:47:49,883 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.75 vs. limit=12.0 2023-06-27 12:48:36,446 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1810392.0, ans=0.1 2023-06-27 12:48:40,872 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.65 vs. limit=10.0 2023-06-27 12:48:41,831 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1810452.0, ans=0.125 2023-06-27 12:48:58,234 INFO [train.py:996] (0/4) Epoch 10, batch 27300, loss[loss=0.2564, simple_loss=0.348, pruned_loss=0.08244, over 21534.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.3111, pruned_loss=0.07439, over 4278152.65 frames. ], batch size: 471, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:49:06,658 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.182e+02 6.508e+02 9.291e+02 1.314e+03 3.410e+03, threshold=1.858e+03, percent-clipped=10.0 2023-06-27 12:49:10,641 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1810512.0, ans=0.125 2023-06-27 12:50:29,885 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.99 vs. limit=22.5 2023-06-27 12:50:38,159 INFO [train.py:996] (0/4) Epoch 10, batch 27350, loss[loss=0.2256, simple_loss=0.3078, pruned_loss=0.07169, over 21885.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3141, pruned_loss=0.07556, over 4279488.93 frames. ], batch size: 118, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:51:16,196 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.84 vs. limit=12.0 2023-06-27 12:51:24,995 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1810932.0, ans=0.1 2023-06-27 12:51:28,091 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1810932.0, ans=0.125 2023-06-27 12:51:28,097 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1810932.0, ans=0.2 2023-06-27 12:51:49,068 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1810992.0, ans=0.0 2023-06-27 12:52:08,533 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1811052.0, ans=0.09899494936611666 2023-06-27 12:52:12,673 INFO [train.py:996] (0/4) Epoch 10, batch 27400, loss[loss=0.1901, simple_loss=0.2595, pruned_loss=0.06037, over 21372.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3078, pruned_loss=0.07419, over 4287086.36 frames. ], batch size: 177, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:52:20,978 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.204e+02 5.726e+02 8.020e+02 1.365e+03 2.836e+03, threshold=1.604e+03, percent-clipped=8.0 2023-06-27 12:52:28,572 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1811172.0, ans=0.1 2023-06-27 12:52:59,342 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1811232.0, ans=0.0 2023-06-27 12:53:20,514 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1811292.0, ans=0.0 2023-06-27 12:53:54,326 INFO [train.py:996] (0/4) Epoch 10, batch 27450, loss[loss=0.2317, simple_loss=0.3114, pruned_loss=0.07602, over 21300.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.3015, pruned_loss=0.07231, over 4281250.54 frames. ], batch size: 548, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:54:22,658 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1811472.0, ans=0.0 2023-06-27 12:54:48,113 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1811532.0, ans=0.125 2023-06-27 12:55:30,301 INFO [train.py:996] (0/4) Epoch 10, batch 27500, loss[loss=0.219, simple_loss=0.2908, pruned_loss=0.07361, over 21859.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.3006, pruned_loss=0.07282, over 4279736.33 frames. ], batch size: 371, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:55:38,230 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.956e+02 6.120e+02 9.251e+02 1.541e+03 3.924e+03, threshold=1.850e+03, percent-clipped=23.0 2023-06-27 12:55:56,885 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.07 vs. limit=15.0 2023-06-27 12:56:07,504 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1811772.0, ans=0.1 2023-06-27 12:56:29,719 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1811832.0, ans=0.1 2023-06-27 12:56:57,583 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.48 vs. limit=15.0 2023-06-27 12:57:03,275 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1811952.0, ans=0.125 2023-06-27 12:57:09,553 INFO [train.py:996] (0/4) Epoch 10, batch 27550, loss[loss=0.1804, simple_loss=0.2499, pruned_loss=0.05543, over 21245.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2961, pruned_loss=0.06963, over 4288114.35 frames. ], batch size: 548, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 12:57:17,066 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.28 vs. limit=15.0 2023-06-27 12:57:54,717 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1812132.0, ans=0.035 2023-06-27 12:57:59,680 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1812132.0, ans=0.125 2023-06-27 12:58:08,779 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1812132.0, ans=0.125 2023-06-27 12:58:23,187 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1812192.0, ans=0.0 2023-06-27 12:58:29,798 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1812192.0, ans=0.125 2023-06-27 12:58:41,079 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1812252.0, ans=0.2 2023-06-27 12:58:48,793 INFO [train.py:996] (0/4) Epoch 10, batch 27600, loss[loss=0.2203, simple_loss=0.3024, pruned_loss=0.06911, over 19960.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.289, pruned_loss=0.06794, over 4284939.29 frames. ], batch size: 702, lr: 2.87e-03, grad_scale: 32.0 2023-06-27 12:58:56,902 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.403e+02 6.402e+02 9.119e+02 1.240e+03 2.150e+03, threshold=1.824e+03, percent-clipped=4.0 2023-06-27 12:59:42,857 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1812432.0, ans=0.0 2023-06-27 13:00:29,608 INFO [train.py:996] (0/4) Epoch 10, batch 27650, loss[loss=0.2024, simple_loss=0.2901, pruned_loss=0.0574, over 21730.00 frames. ], tot_loss[loss=0.2091, simple_loss=0.2833, pruned_loss=0.06742, over 4277605.45 frames. ], batch size: 298, lr: 2.87e-03, grad_scale: 32.0 2023-06-27 13:01:00,684 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1812672.0, ans=0.125 2023-06-27 13:01:46,372 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1812792.0, ans=0.125 2023-06-27 13:02:10,577 INFO [train.py:996] (0/4) Epoch 10, batch 27700, loss[loss=0.1872, simple_loss=0.2711, pruned_loss=0.05163, over 21466.00 frames. ], tot_loss[loss=0.2087, simple_loss=0.2844, pruned_loss=0.06655, over 4284767.38 frames. ], batch size: 211, lr: 2.87e-03, grad_scale: 32.0 2023-06-27 13:02:23,379 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.384e+02 6.863e+02 9.869e+02 1.519e+03 3.382e+03, threshold=1.974e+03, percent-clipped=13.0 2023-06-27 13:02:24,046 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1812912.0, ans=0.125 2023-06-27 13:02:43,855 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.30 vs. limit=15.0 2023-06-27 13:02:53,959 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1812972.0, ans=0.1 2023-06-27 13:02:55,601 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1813032.0, ans=0.125 2023-06-27 13:03:03,931 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.14 vs. limit=15.0 2023-06-27 13:03:50,225 INFO [train.py:996] (0/4) Epoch 10, batch 27750, loss[loss=0.2037, simple_loss=0.2876, pruned_loss=0.05992, over 21393.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.2863, pruned_loss=0.06604, over 4282199.24 frames. ], batch size: 548, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:04:56,868 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1813392.0, ans=0.125 2023-06-27 13:05:26,698 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.58 vs. limit=15.0 2023-06-27 13:05:28,655 INFO [train.py:996] (0/4) Epoch 10, batch 27800, loss[loss=0.2344, simple_loss=0.298, pruned_loss=0.08536, over 21832.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.2872, pruned_loss=0.06685, over 4287802.62 frames. ], batch size: 441, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:05:31,676 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.36 vs. limit=15.0 2023-06-27 13:05:43,188 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.682e+02 6.752e+02 9.329e+02 1.344e+03 2.939e+03, threshold=1.866e+03, percent-clipped=10.0 2023-06-27 13:06:04,342 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1813572.0, ans=0.125 2023-06-27 13:06:04,957 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=1813572.0, ans=10.0 2023-06-27 13:07:09,261 INFO [train.py:996] (0/4) Epoch 10, batch 27850, loss[loss=0.2031, simple_loss=0.2973, pruned_loss=0.05447, over 21752.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.2867, pruned_loss=0.06758, over 4295230.52 frames. ], batch size: 247, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:07:17,065 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.52 vs. limit=12.0 2023-06-27 13:07:38,245 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=1813872.0, ans=15.0 2023-06-27 13:07:47,471 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1813872.0, ans=0.2 2023-06-27 13:08:20,997 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1813992.0, ans=0.125 2023-06-27 13:08:47,654 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1814052.0, ans=0.1 2023-06-27 13:09:01,132 INFO [train.py:996] (0/4) Epoch 10, batch 27900, loss[loss=0.2135, simple_loss=0.3135, pruned_loss=0.05678, over 21784.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2969, pruned_loss=0.06956, over 4294997.07 frames. ], batch size: 332, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:09:15,847 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.509e+02 6.352e+02 8.865e+02 1.400e+03 2.806e+03, threshold=1.773e+03, percent-clipped=7.0 2023-06-27 13:09:21,465 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=1814172.0, ans=0.025 2023-06-27 13:10:48,714 INFO [train.py:996] (0/4) Epoch 10, batch 27950, loss[loss=0.2006, simple_loss=0.3171, pruned_loss=0.042, over 20804.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2958, pruned_loss=0.06623, over 4292897.38 frames. ], batch size: 607, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:11:49,564 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1814592.0, ans=0.0 2023-06-27 13:12:11,365 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1814652.0, ans=0.2 2023-06-27 13:12:28,087 INFO [train.py:996] (0/4) Epoch 10, batch 28000, loss[loss=0.2094, simple_loss=0.2912, pruned_loss=0.06379, over 21894.00 frames. ], tot_loss[loss=0.21, simple_loss=0.2931, pruned_loss=0.06342, over 4292408.32 frames. ], batch size: 118, lr: 2.87e-03, grad_scale: 32.0 2023-06-27 13:12:30,376 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1814712.0, ans=0.125 2023-06-27 13:12:42,691 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.320e+02 5.982e+02 8.841e+02 1.274e+03 3.365e+03, threshold=1.768e+03, percent-clipped=7.0 2023-06-27 13:12:56,540 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1814772.0, ans=0.1 2023-06-27 13:13:13,293 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1814832.0, ans=0.2 2023-06-27 13:13:45,946 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1814892.0, ans=0.04949747468305833 2023-06-27 13:14:05,481 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1814952.0, ans=0.125 2023-06-27 13:14:14,333 INFO [train.py:996] (0/4) Epoch 10, batch 28050, loss[loss=0.1307, simple_loss=0.1843, pruned_loss=0.03856, over 16594.00 frames. ], tot_loss[loss=0.2097, simple_loss=0.2897, pruned_loss=0.06483, over 4289955.72 frames. ], batch size: 62, lr: 2.87e-03, grad_scale: 32.0 2023-06-27 13:14:24,065 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.73 vs. limit=22.5 2023-06-27 13:15:00,551 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1815132.0, ans=0.1 2023-06-27 13:15:23,171 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1815192.0, ans=0.125 2023-06-27 13:15:54,363 INFO [train.py:996] (0/4) Epoch 10, batch 28100, loss[loss=0.2029, simple_loss=0.274, pruned_loss=0.06591, over 22002.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.2881, pruned_loss=0.0652, over 4281133.81 frames. ], batch size: 103, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:16:06,170 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.041e+02 5.968e+02 9.165e+02 1.416e+03 2.614e+03, threshold=1.833e+03, percent-clipped=9.0 2023-06-27 13:17:12,938 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.21 vs. limit=6.0 2023-06-27 13:17:34,194 INFO [train.py:996] (0/4) Epoch 10, batch 28150, loss[loss=0.2268, simple_loss=0.3261, pruned_loss=0.06376, over 19762.00 frames. ], tot_loss[loss=0.2065, simple_loss=0.2817, pruned_loss=0.06562, over 4267554.84 frames. ], batch size: 702, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:17:41,230 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1815612.0, ans=0.0 2023-06-27 13:18:30,230 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.66 vs. limit=15.0 2023-06-27 13:18:57,936 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.19 vs. limit=15.0 2023-06-27 13:19:14,731 INFO [train.py:996] (0/4) Epoch 10, batch 28200, loss[loss=0.2106, simple_loss=0.2851, pruned_loss=0.068, over 21732.00 frames. ], tot_loss[loss=0.2065, simple_loss=0.2802, pruned_loss=0.06641, over 4276823.46 frames. ], batch size: 247, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:19:26,340 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.905e+02 6.047e+02 9.821e+02 1.464e+03 4.986e+03, threshold=1.964e+03, percent-clipped=9.0 2023-06-27 13:19:31,967 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1815972.0, ans=0.125 2023-06-27 13:19:38,576 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1815972.0, ans=0.0 2023-06-27 13:20:34,097 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1816092.0, ans=0.2 2023-06-27 13:20:54,969 INFO [train.py:996] (0/4) Epoch 10, batch 28250, loss[loss=0.2158, simple_loss=0.2757, pruned_loss=0.07798, over 21539.00 frames. ], tot_loss[loss=0.21, simple_loss=0.2829, pruned_loss=0.06857, over 4280190.14 frames. ], batch size: 391, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:21:07,193 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1816212.0, ans=0.125 2023-06-27 13:21:12,205 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1816272.0, ans=0.125 2023-06-27 13:21:47,129 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1816332.0, ans=0.0 2023-06-27 13:22:36,307 INFO [train.py:996] (0/4) Epoch 10, batch 28300, loss[loss=0.1712, simple_loss=0.2873, pruned_loss=0.02755, over 20754.00 frames. ], tot_loss[loss=0.2063, simple_loss=0.2813, pruned_loss=0.06567, over 4277264.58 frames. ], batch size: 608, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:22:41,964 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1816512.0, ans=0.125 2023-06-27 13:22:47,918 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.032e+02 5.786e+02 9.744e+02 1.588e+03 3.149e+03, threshold=1.949e+03, percent-clipped=13.0 2023-06-27 13:23:03,268 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.01 vs. limit=22.5 2023-06-27 13:23:53,590 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1816692.0, ans=0.125 2023-06-27 13:24:00,270 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1816752.0, ans=0.125 2023-06-27 13:24:13,023 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1816752.0, ans=0.0 2023-06-27 13:24:15,588 INFO [train.py:996] (0/4) Epoch 10, batch 28350, loss[loss=0.178, simple_loss=0.2446, pruned_loss=0.05566, over 21841.00 frames. ], tot_loss[loss=0.1998, simple_loss=0.278, pruned_loss=0.06075, over 4268567.75 frames. ], batch size: 98, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:24:52,624 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1816872.0, ans=0.0 2023-06-27 13:25:22,038 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1816992.0, ans=0.0 2023-06-27 13:25:28,734 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1816992.0, ans=0.1 2023-06-27 13:25:32,681 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.68 vs. limit=15.0 2023-06-27 13:25:55,976 INFO [train.py:996] (0/4) Epoch 10, batch 28400, loss[loss=0.1973, simple_loss=0.2634, pruned_loss=0.0656, over 21739.00 frames. ], tot_loss[loss=0.1979, simple_loss=0.2741, pruned_loss=0.06088, over 4263853.63 frames. ], batch size: 371, lr: 2.87e-03, grad_scale: 32.0 2023-06-27 13:26:18,381 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.150e+02 6.326e+02 1.038e+03 1.651e+03 3.367e+03, threshold=2.075e+03, percent-clipped=16.0 2023-06-27 13:26:33,496 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1817172.0, ans=0.125 2023-06-27 13:27:15,223 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.38 vs. limit=12.0 2023-06-27 13:27:19,481 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1817352.0, ans=0.1 2023-06-27 13:27:37,251 INFO [train.py:996] (0/4) Epoch 10, batch 28450, loss[loss=0.1826, simple_loss=0.2509, pruned_loss=0.05718, over 20083.00 frames. ], tot_loss[loss=0.2042, simple_loss=0.2799, pruned_loss=0.06427, over 4252708.78 frames. ], batch size: 703, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:27:39,456 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 13:27:56,868 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1817412.0, ans=0.2 2023-06-27 13:28:09,053 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1817472.0, ans=0.125 2023-06-27 13:28:24,886 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1817472.0, ans=0.125 2023-06-27 13:28:36,711 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1817532.0, ans=0.125 2023-06-27 13:29:09,287 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1817652.0, ans=0.125 2023-06-27 13:29:12,782 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1817652.0, ans=0.0 2023-06-27 13:29:27,867 INFO [train.py:996] (0/4) Epoch 10, batch 28500, loss[loss=0.2253, simple_loss=0.3059, pruned_loss=0.07234, over 21773.00 frames. ], tot_loss[loss=0.2075, simple_loss=0.2824, pruned_loss=0.0663, over 4264461.17 frames. ], batch size: 332, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:29:30,266 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1817712.0, ans=0.0 2023-06-27 13:29:43,079 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1817712.0, ans=0.125 2023-06-27 13:29:50,451 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.832e+02 6.822e+02 1.044e+03 1.325e+03 2.451e+03, threshold=2.088e+03, percent-clipped=2.0 2023-06-27 13:30:00,308 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.55 vs. limit=6.0 2023-06-27 13:30:07,826 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1817772.0, ans=0.2 2023-06-27 13:30:22,916 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1817832.0, ans=0.125 2023-06-27 13:31:14,231 INFO [train.py:996] (0/4) Epoch 10, batch 28550, loss[loss=0.2519, simple_loss=0.3503, pruned_loss=0.07674, over 21648.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2906, pruned_loss=0.06901, over 4270964.42 frames. ], batch size: 263, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:32:53,753 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1818312.0, ans=0.0 2023-06-27 13:32:59,169 INFO [train.py:996] (0/4) Epoch 10, batch 28600, loss[loss=0.2264, simple_loss=0.3142, pruned_loss=0.06928, over 21346.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.298, pruned_loss=0.07144, over 4274550.00 frames. ], batch size: 131, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:33:12,232 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.169e+02 6.322e+02 9.283e+02 1.275e+03 2.692e+03, threshold=1.857e+03, percent-clipped=3.0 2023-06-27 13:33:19,561 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1818372.0, ans=0.0 2023-06-27 13:33:42,603 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1818432.0, ans=0.1 2023-06-27 13:34:40,150 INFO [train.py:996] (0/4) Epoch 10, batch 28650, loss[loss=0.1835, simple_loss=0.2365, pruned_loss=0.06527, over 21267.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.2924, pruned_loss=0.07062, over 4272373.84 frames. ], batch size: 549, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:34:48,888 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1818612.0, ans=0.125 2023-06-27 13:34:58,031 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.49 vs. limit=10.0 2023-06-27 13:35:07,246 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1818672.0, ans=0.2 2023-06-27 13:35:54,470 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1818852.0, ans=0.125 2023-06-27 13:36:12,927 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.61 vs. limit=15.0 2023-06-27 13:36:16,628 INFO [train.py:996] (0/4) Epoch 10, batch 28700, loss[loss=0.2303, simple_loss=0.3037, pruned_loss=0.07845, over 21299.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2916, pruned_loss=0.07122, over 4277467.17 frames. ], batch size: 143, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:36:19,396 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.16 vs. limit=12.0 2023-06-27 13:36:19,680 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.29 vs. limit=6.0 2023-06-27 13:36:29,764 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.086e+02 6.900e+02 1.037e+03 1.524e+03 3.185e+03, threshold=2.075e+03, percent-clipped=14.0 2023-06-27 13:36:58,208 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1819032.0, ans=0.0 2023-06-27 13:37:09,545 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1819092.0, ans=0.125 2023-06-27 13:37:39,998 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1819152.0, ans=0.125 2023-06-27 13:37:44,985 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1819152.0, ans=0.1 2023-06-27 13:37:57,656 INFO [train.py:996] (0/4) Epoch 10, batch 28750, loss[loss=0.1926, simple_loss=0.2821, pruned_loss=0.05155, over 21683.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.2903, pruned_loss=0.0718, over 4280897.73 frames. ], batch size: 263, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:38:06,570 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1819212.0, ans=0.125 2023-06-27 13:39:33,333 INFO [train.py:996] (0/4) Epoch 10, batch 28800, loss[loss=0.2218, simple_loss=0.3035, pruned_loss=0.06999, over 16993.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.2948, pruned_loss=0.072, over 4279035.00 frames. ], batch size: 60, lr: 2.87e-03, grad_scale: 32.0 2023-06-27 13:39:47,037 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.078e+02 7.759e+02 9.840e+02 1.249e+03 3.010e+03, threshold=1.968e+03, percent-clipped=7.0 2023-06-27 13:40:23,717 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.24 vs. limit=15.0 2023-06-27 13:41:09,922 INFO [train.py:996] (0/4) Epoch 10, batch 28850, loss[loss=0.224, simple_loss=0.2959, pruned_loss=0.076, over 21885.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.2964, pruned_loss=0.07334, over 4286734.77 frames. ], batch size: 371, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:41:41,575 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1819872.0, ans=0.2 2023-06-27 13:42:21,022 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1819992.0, ans=0.125 2023-06-27 13:42:42,871 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1820052.0, ans=0.0 2023-06-27 13:42:50,435 INFO [train.py:996] (0/4) Epoch 10, batch 28900, loss[loss=0.2849, simple_loss=0.4053, pruned_loss=0.08227, over 19745.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.299, pruned_loss=0.07489, over 4282803.38 frames. ], batch size: 702, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:43:05,415 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.892e+02 6.958e+02 1.036e+03 1.416e+03 3.093e+03, threshold=2.073e+03, percent-clipped=9.0 2023-06-27 13:43:48,063 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1820232.0, ans=0.125 2023-06-27 13:44:13,550 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1820292.0, ans=0.125 2023-06-27 13:44:18,790 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1820352.0, ans=0.0 2023-06-27 13:44:33,633 INFO [train.py:996] (0/4) Epoch 10, batch 28950, loss[loss=0.2559, simple_loss=0.3424, pruned_loss=0.0847, over 21555.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.2997, pruned_loss=0.07402, over 4276567.74 frames. ], batch size: 471, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:45:23,533 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1820472.0, ans=0.0 2023-06-27 13:45:35,478 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.67 vs. limit=15.0 2023-06-27 13:46:14,084 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1820712.0, ans=0.125 2023-06-27 13:46:15,101 INFO [train.py:996] (0/4) Epoch 10, batch 29000, loss[loss=0.2277, simple_loss=0.3103, pruned_loss=0.0725, over 21776.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.3037, pruned_loss=0.07344, over 4277543.78 frames. ], batch size: 332, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:46:43,661 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.952e+02 6.978e+02 9.216e+02 1.338e+03 4.286e+03, threshold=1.843e+03, percent-clipped=9.0 2023-06-27 13:46:46,327 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.02 vs. limit=10.0 2023-06-27 13:46:59,610 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.37 vs. limit=15.0 2023-06-27 13:47:19,057 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1820892.0, ans=0.125 2023-06-27 13:47:35,422 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1820952.0, ans=0.5 2023-06-27 13:47:46,679 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1820952.0, ans=0.125 2023-06-27 13:48:04,744 INFO [train.py:996] (0/4) Epoch 10, batch 29050, loss[loss=0.2024, simple_loss=0.276, pruned_loss=0.06434, over 21286.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.3022, pruned_loss=0.07418, over 4284012.26 frames. ], batch size: 176, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:49:08,647 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1821192.0, ans=0.2 2023-06-27 13:49:39,553 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1821312.0, ans=0.035 2023-06-27 13:49:40,668 INFO [train.py:996] (0/4) Epoch 10, batch 29100, loss[loss=0.1627, simple_loss=0.2273, pruned_loss=0.0491, over 21488.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.2935, pruned_loss=0.07202, over 4283264.40 frames. ], batch size: 212, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:49:55,603 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.400e+02 6.043e+02 9.332e+02 1.585e+03 3.722e+03, threshold=1.866e+03, percent-clipped=13.0 2023-06-27 13:50:11,796 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.52 vs. limit=15.0 2023-06-27 13:50:39,378 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.83 vs. limit=10.0 2023-06-27 13:51:16,638 INFO [train.py:996] (0/4) Epoch 10, batch 29150, loss[loss=0.2327, simple_loss=0.3269, pruned_loss=0.06928, over 21796.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2919, pruned_loss=0.07049, over 4281918.44 frames. ], batch size: 282, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:51:27,524 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1821612.0, ans=0.125 2023-06-27 13:52:10,297 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1821792.0, ans=0.125 2023-06-27 13:52:40,847 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1821852.0, ans=0.125 2023-06-27 13:52:48,468 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1821852.0, ans=0.125 2023-06-27 13:52:50,138 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1821852.0, ans=0.025 2023-06-27 13:52:57,671 INFO [train.py:996] (0/4) Epoch 10, batch 29200, loss[loss=0.1902, simple_loss=0.2617, pruned_loss=0.05934, over 21328.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2884, pruned_loss=0.07015, over 4275828.82 frames. ], batch size: 131, lr: 2.87e-03, grad_scale: 32.0 2023-06-27 13:53:08,244 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1821912.0, ans=0.1 2023-06-27 13:53:13,864 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.81 vs. limit=15.0 2023-06-27 13:53:14,036 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.182e+02 6.160e+02 1.002e+03 1.749e+03 3.498e+03, threshold=2.004e+03, percent-clipped=20.0 2023-06-27 13:53:27,705 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1821972.0, ans=0.1 2023-06-27 13:53:32,434 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1822032.0, ans=0.125 2023-06-27 13:54:06,722 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1822152.0, ans=0.125 2023-06-27 13:54:15,394 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1822152.0, ans=0.0 2023-06-27 13:54:29,615 INFO [train.py:996] (0/4) Epoch 10, batch 29250, loss[loss=0.2276, simple_loss=0.3187, pruned_loss=0.0682, over 21721.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.2868, pruned_loss=0.06813, over 4270424.52 frames. ], batch size: 352, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:55:11,865 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1822332.0, ans=0.035 2023-06-27 13:55:30,911 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1822392.0, ans=0.125 2023-06-27 13:56:02,214 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1822452.0, ans=0.125 2023-06-27 13:56:06,668 INFO [train.py:996] (0/4) Epoch 10, batch 29300, loss[loss=0.2487, simple_loss=0.3069, pruned_loss=0.09527, over 21309.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2901, pruned_loss=0.06769, over 4278678.04 frames. ], batch size: 471, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:56:22,985 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.212e+02 5.568e+02 7.846e+02 1.257e+03 2.359e+03, threshold=1.569e+03, percent-clipped=3.0 2023-06-27 13:56:23,491 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1822572.0, ans=0.0 2023-06-27 13:56:46,976 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1822632.0, ans=0.0 2023-06-27 13:56:48,312 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1822632.0, ans=0.2 2023-06-27 13:57:26,621 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1822692.0, ans=0.125 2023-06-27 13:57:48,083 INFO [train.py:996] (0/4) Epoch 10, batch 29350, loss[loss=0.2013, simple_loss=0.2841, pruned_loss=0.05922, over 21682.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2863, pruned_loss=0.06704, over 4269040.22 frames. ], batch size: 282, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:59:04,842 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1822992.0, ans=0.07 2023-06-27 13:59:31,049 INFO [train.py:996] (0/4) Epoch 10, batch 29400, loss[loss=0.2123, simple_loss=0.3065, pruned_loss=0.05908, over 21195.00 frames. ], tot_loss[loss=0.2077, simple_loss=0.2852, pruned_loss=0.06506, over 4256745.44 frames. ], batch size: 548, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:59:47,674 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.061e+02 6.893e+02 1.012e+03 1.543e+03 3.903e+03, threshold=2.024e+03, percent-clipped=23.0 2023-06-27 14:00:19,248 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1823232.0, ans=0.0 2023-06-27 14:00:38,115 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1823292.0, ans=0.0 2023-06-27 14:00:56,893 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1823352.0, ans=0.125 2023-06-27 14:01:12,896 INFO [train.py:996] (0/4) Epoch 10, batch 29450, loss[loss=0.2515, simple_loss=0.3305, pruned_loss=0.08625, over 21608.00 frames. ], tot_loss[loss=0.2061, simple_loss=0.2841, pruned_loss=0.06407, over 4266849.28 frames. ], batch size: 389, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 14:02:20,266 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1823592.0, ans=0.125 2023-06-27 14:02:25,084 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1823592.0, ans=0.0 2023-06-27 14:02:35,032 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1823652.0, ans=0.025 2023-06-27 14:02:35,777 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.72 vs. limit=5.0 2023-06-27 14:02:38,166 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1823652.0, ans=0.125 2023-06-27 14:02:45,813 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1823652.0, ans=0.0 2023-06-27 14:02:51,895 INFO [train.py:996] (0/4) Epoch 10, batch 29500, loss[loss=0.1919, simple_loss=0.2551, pruned_loss=0.06438, over 20225.00 frames. ], tot_loss[loss=0.2095, simple_loss=0.2868, pruned_loss=0.06613, over 4268343.45 frames. ], batch size: 703, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 14:02:53,236 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.85 vs. limit=15.0 2023-06-27 14:03:07,701 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.631e+02 6.865e+02 1.061e+03 1.645e+03 3.419e+03, threshold=2.123e+03, percent-clipped=12.0 2023-06-27 14:03:49,550 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1823832.0, ans=0.125 2023-06-27 14:03:53,099 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.50 vs. limit=15.0 2023-06-27 14:04:25,647 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1823952.0, ans=0.125 2023-06-27 14:04:25,676 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1823952.0, ans=0.125 2023-06-27 14:04:27,018 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-304000.pt 2023-06-27 14:04:33,492 INFO [train.py:996] (0/4) Epoch 10, batch 29550, loss[loss=0.2258, simple_loss=0.2919, pruned_loss=0.07986, over 21818.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.2861, pruned_loss=0.06787, over 4279449.81 frames. ], batch size: 441, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 14:04:39,891 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.75 vs. limit=15.0 2023-06-27 14:06:03,165 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1824252.0, ans=0.1 2023-06-27 14:06:10,189 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1824312.0, ans=0.125 2023-06-27 14:06:11,157 INFO [train.py:996] (0/4) Epoch 10, batch 29600, loss[loss=0.2416, simple_loss=0.3168, pruned_loss=0.08323, over 21353.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2919, pruned_loss=0.06953, over 4283749.89 frames. ], batch size: 176, lr: 2.87e-03, grad_scale: 32.0 2023-06-27 14:06:29,601 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.301e+02 5.949e+02 7.426e+02 9.960e+02 2.480e+03, threshold=1.485e+03, percent-clipped=1.0 2023-06-27 14:07:17,584 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1824492.0, ans=0.125 2023-06-27 14:07:43,060 INFO [train.py:996] (0/4) Epoch 10, batch 29650, loss[loss=0.2364, simple_loss=0.3071, pruned_loss=0.08289, over 21711.00 frames. ], tot_loss[loss=0.2107, simple_loss=0.2887, pruned_loss=0.06639, over 4288848.73 frames. ], batch size: 441, lr: 2.86e-03, grad_scale: 16.0 2023-06-27 14:07:44,142 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.19 vs. limit=15.0 2023-06-27 14:07:48,868 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.47 vs. limit=15.0 2023-06-27 14:08:37,607 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1824732.0, ans=0.1 2023-06-27 14:08:53,313 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.61 vs. limit=15.0 2023-06-27 14:08:54,313 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 14:09:05,979 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1824852.0, ans=0.0 2023-06-27 14:09:12,710 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1824852.0, ans=0.125 2023-06-27 14:09:20,374 INFO [train.py:996] (0/4) Epoch 10, batch 29700, loss[loss=0.2341, simple_loss=0.3403, pruned_loss=0.06396, over 21800.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2897, pruned_loss=0.06621, over 4295320.00 frames. ], batch size: 282, lr: 2.86e-03, grad_scale: 16.0 2023-06-27 14:09:42,884 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.057e+02 7.297e+02 1.065e+03 1.869e+03 3.621e+03, threshold=2.131e+03, percent-clipped=32.0 2023-06-27 14:10:01,132 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff2.min_abs, batch_count=1824972.0, ans=0.1 2023-06-27 14:10:02,590 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1824972.0, ans=0.0 2023-06-27 14:10:07,682 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1825032.0, ans=0.07 2023-06-27 14:10:25,809 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1825092.0, ans=0.04949747468305833 2023-06-27 14:10:30,569 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1825092.0, ans=0.125 2023-06-27 14:10:52,050 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1825152.0, ans=0.125 2023-06-27 14:10:56,005 INFO [train.py:996] (0/4) Epoch 10, batch 29750, loss[loss=0.1932, simple_loss=0.2732, pruned_loss=0.05657, over 21331.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2952, pruned_loss=0.06653, over 4292973.27 frames. ], batch size: 176, lr: 2.86e-03, grad_scale: 16.0 2023-06-27 14:11:11,073 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1825272.0, ans=0.05 2023-06-27 14:11:12,885 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1825272.0, ans=0.125 2023-06-27 14:11:54,783 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1825392.0, ans=0.07 2023-06-27 14:11:58,809 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.11 vs. limit=15.0 2023-06-27 14:12:27,396 INFO [train.py:996] (0/4) Epoch 10, batch 29800, loss[loss=0.2156, simple_loss=0.3003, pruned_loss=0.06545, over 21532.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2979, pruned_loss=0.06729, over 4289291.10 frames. ], batch size: 211, lr: 2.86e-03, grad_scale: 16.0 2023-06-27 14:12:28,121 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1825512.0, ans=0.125 2023-06-27 14:12:51,510 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.238e+02 6.246e+02 9.031e+02 1.363e+03 2.753e+03, threshold=1.806e+03, percent-clipped=5.0 2023-06-27 14:13:16,237 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1825632.0, ans=0.0 2023-06-27 14:13:44,839 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1825752.0, ans=0.0 2023-06-27 14:13:46,320 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=1825752.0, ans=0.5 2023-06-27 14:13:52,274 INFO [train.py:996] (0/4) Epoch 10, batch 29850, loss[loss=0.1819, simple_loss=0.2594, pruned_loss=0.05221, over 21772.00 frames. ], tot_loss[loss=0.2125, simple_loss=0.2943, pruned_loss=0.06539, over 4282229.46 frames. ], batch size: 247, lr: 2.86e-03, grad_scale: 8.0 2023-06-27 14:14:03,055 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.18 vs. limit=6.0 2023-06-27 14:14:05,602 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1825812.0, ans=0.05 2023-06-27 14:15:27,917 INFO [train.py:996] (0/4) Epoch 10, batch 29900, loss[loss=0.1767, simple_loss=0.2191, pruned_loss=0.06711, over 20198.00 frames. ], tot_loss[loss=0.2129, simple_loss=0.2923, pruned_loss=0.06676, over 4284722.40 frames. ], batch size: 704, lr: 2.86e-03, grad_scale: 8.0 2023-06-27 14:15:33,555 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1826112.0, ans=0.2 2023-06-27 14:16:07,295 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.080e+02 5.707e+02 7.601e+02 1.173e+03 3.198e+03, threshold=1.520e+03, percent-clipped=6.0 2023-06-27 14:16:26,702 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1826232.0, ans=0.0 2023-06-27 14:16:43,454 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1826292.0, ans=0.125 2023-06-27 14:16:51,536 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1826292.0, ans=0.2 2023-06-27 14:17:00,015 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1826352.0, ans=0.1 2023-06-27 14:17:10,655 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.69 vs. limit=6.0 2023-06-27 14:17:11,027 INFO [train.py:996] (0/4) Epoch 10, batch 29950, loss[loss=0.1932, simple_loss=0.2436, pruned_loss=0.07136, over 20185.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.2956, pruned_loss=0.07084, over 4278087.19 frames. ], batch size: 707, lr: 2.86e-03, grad_scale: 8.0 2023-06-27 14:17:18,983 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.18 vs. limit=15.0 2023-06-27 14:17:39,904 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1826472.0, ans=0.1 2023-06-27 14:17:56,037 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.07 vs. limit=15.0 2023-06-27 14:17:57,354 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1826532.0, ans=0.1 2023-06-27 14:18:00,720 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1826532.0, ans=0.125 2023-06-27 14:18:20,137 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1826592.0, ans=0.125 2023-06-27 14:19:04,987 INFO [train.py:996] (0/4) Epoch 10, batch 30000, loss[loss=0.1937, simple_loss=0.2866, pruned_loss=0.05046, over 21639.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.2979, pruned_loss=0.07055, over 4271363.16 frames. ], batch size: 230, lr: 2.86e-03, grad_scale: 16.0 2023-06-27 14:19:04,989 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-27 14:19:22,073 INFO [train.py:1028] (0/4) Epoch 10, validation: loss=0.2475, simple_loss=0.3412, pruned_loss=0.07692, over 1796401.00 frames. 2023-06-27 14:19:22,074 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23714MB 2023-06-27 14:19:43,675 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.141e+02 6.862e+02 9.553e+02 1.677e+03 3.481e+03, threshold=1.911e+03, percent-clipped=29.0 2023-06-27 14:19:59,130 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.67 vs. limit=22.5 2023-06-27 14:20:09,710 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1826832.0, ans=0.2 2023-06-27 14:20:13,558 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1826832.0, ans=0.125 2023-06-27 14:21:05,356 INFO [train.py:996] (0/4) Epoch 10, batch 30050, loss[loss=0.2031, simple_loss=0.3149, pruned_loss=0.04559, over 20801.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.2994, pruned_loss=0.06717, over 4273715.28 frames. ], batch size: 607, lr: 2.86e-03, grad_scale: 16.0 2023-06-27 14:21:53,797 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=1827132.0, ans=15.0 2023-06-27 14:21:59,815 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 14:22:18,007 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1827192.0, ans=0.0 2023-06-27 14:22:39,168 INFO [train.py:996] (0/4) Epoch 10, batch 30100, loss[loss=0.2071, simple_loss=0.2736, pruned_loss=0.07032, over 21209.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.2985, pruned_loss=0.06716, over 4269020.69 frames. ], batch size: 176, lr: 2.86e-03, grad_scale: 16.0 2023-06-27 14:22:41,277 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1827312.0, ans=0.0 2023-06-27 14:22:42,927 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1827312.0, ans=0.125 2023-06-27 14:22:58,224 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.865e+02 7.541e+02 1.187e+03 1.645e+03 3.691e+03, threshold=2.374e+03, percent-clipped=12.0 2023-06-27 14:23:22,585 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.12 vs. limit=22.5 2023-06-27 14:23:31,830 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1827432.0, ans=0.0 2023-06-27 14:23:33,621 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1827432.0, ans=0.125 2023-06-27 14:23:42,185 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1827492.0, ans=0.0 2023-06-27 14:23:55,757 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1827492.0, ans=0.0 2023-06-27 14:23:55,823 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1827492.0, ans=0.07 2023-06-27 14:24:17,470 INFO [train.py:996] (0/4) Epoch 10, batch 30150, loss[loss=0.258, simple_loss=0.3277, pruned_loss=0.09416, over 21281.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2952, pruned_loss=0.06895, over 4272018.92 frames. ], batch size: 143, lr: 2.86e-03, grad_scale: 16.0 2023-06-27 14:24:28,303 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1827612.0, ans=0.125 2023-06-27 14:25:20,681 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1827732.0, ans=0.125 2023-06-27 14:25:27,813 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1827792.0, ans=0.125 2023-06-27 14:25:38,797 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.62 vs. limit=10.0 2023-06-27 14:26:02,897 INFO [train.py:996] (0/4) Epoch 10, batch 30200, loss[loss=0.2124, simple_loss=0.2919, pruned_loss=0.06647, over 21412.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2977, pruned_loss=0.06832, over 4269055.04 frames. ], batch size: 211, lr: 2.86e-03, grad_scale: 16.0 2023-06-27 14:26:41,783 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 14:26:42,657 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.354e+02 6.809e+02 8.710e+02 1.204e+03 2.614e+03, threshold=1.742e+03, percent-clipped=2.0 2023-06-27 14:27:19,919 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1828092.0, ans=0.125 2023-06-27 14:27:26,517 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1828092.0, ans=0.125 2023-06-27 14:27:39,929 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1828152.0, ans=0.0 2023-06-27 14:28:02,215 INFO [train.py:996] (0/4) Epoch 10, batch 30250, loss[loss=0.2043, simple_loss=0.2755, pruned_loss=0.06657, over 21937.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.3033, pruned_loss=0.06908, over 4273307.19 frames. ], batch size: 98, lr: 2.86e-03, grad_scale: 16.0 2023-06-27 14:28:09,401 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1828212.0, ans=0.0 2023-06-27 14:28:15,584 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1828212.0, ans=0.035 2023-06-27 14:28:15,636 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1828212.0, ans=0.125 2023-06-27 14:29:17,096 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1828452.0, ans=0.1 2023-06-27 14:29:38,360 INFO [train.py:996] (0/4) Epoch 10, batch 30300, loss[loss=0.1763, simple_loss=0.2477, pruned_loss=0.05246, over 21647.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.3003, pruned_loss=0.0694, over 4270682.66 frames. ], batch size: 282, lr: 2.86e-03, grad_scale: 16.0 2023-06-27 14:30:01,847 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.29 vs. limit=6.0 2023-06-27 14:30:03,863 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.241e+02 6.596e+02 9.409e+02 1.315e+03 2.834e+03, threshold=1.882e+03, percent-clipped=10.0 2023-06-27 14:30:38,994 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1828692.0, ans=0.0 2023-06-27 14:31:21,109 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1828752.0, ans=0.125 2023-06-27 14:31:27,459 INFO [train.py:996] (0/4) Epoch 10, batch 30350, loss[loss=0.2155, simple_loss=0.2947, pruned_loss=0.06815, over 21646.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.2998, pruned_loss=0.07054, over 4263339.94 frames. ], batch size: 247, lr: 2.86e-03, grad_scale: 16.0 2023-06-27 14:31:48,278 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1828872.0, ans=0.125 2023-06-27 14:32:24,431 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1828992.0, ans=0.125 2023-06-27 14:32:41,793 INFO [train.py:996] (0/4) Epoch 10, batch 30400, loss[loss=0.2084, simple_loss=0.2512, pruned_loss=0.08281, over 20220.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2958, pruned_loss=0.06954, over 4257905.63 frames. ], batch size: 703, lr: 2.86e-03, grad_scale: 32.0 2023-06-27 14:32:44,012 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1829112.0, ans=0.2 2023-06-27 14:32:53,168 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 14:33:09,682 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.426e+02 7.954e+02 1.288e+03 1.926e+03 4.132e+03, threshold=2.577e+03, percent-clipped=26.0 2023-06-27 14:33:36,316 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1829232.0, ans=0.125 2023-06-27 14:34:07,225 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1829412.0, ans=0.0 2023-06-27 14:34:08,242 INFO [train.py:996] (0/4) Epoch 10, batch 30450, loss[loss=0.266, simple_loss=0.3864, pruned_loss=0.0728, over 19850.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2964, pruned_loss=0.06896, over 4199467.44 frames. ], batch size: 702, lr: 2.86e-03, grad_scale: 16.0 2023-06-27 14:34:37,669 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 14:35:00,088 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1829592.0, ans=0.125 2023-06-27 14:35:02,352 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1829592.0, ans=0.125 2023-06-27 14:35:05,313 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.03 vs. limit=15.0 2023-06-27 14:35:12,409 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1829652.0, ans=0.0 2023-06-27 14:35:17,891 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/epoch-10.pt 2023-06-27 14:37:28,490 INFO [train.py:996] (0/4) Epoch 11, batch 0, loss[loss=0.2028, simple_loss=0.2666, pruned_loss=0.06951, over 21498.00 frames. ], tot_loss[loss=0.2028, simple_loss=0.2666, pruned_loss=0.06951, over 21498.00 frames. ], batch size: 195, lr: 2.72e-03, grad_scale: 32.0 2023-06-27 14:37:28,491 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-27 14:37:44,726 INFO [train.py:1028] (0/4) Epoch 11, validation: loss=0.2445, simple_loss=0.3464, pruned_loss=0.07127, over 1796401.00 frames. 2023-06-27 14:37:44,727 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23714MB 2023-06-27 14:38:23,079 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.704e+02 1.606e+03 2.605e+03 4.493e+03 1.142e+04, threshold=5.209e+03, percent-clipped=50.0 2023-06-27 14:38:44,771 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1829796.0, ans=0.0 2023-06-27 14:38:47,068 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=2.99 vs. limit=12.0 2023-06-27 14:38:49,893 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1829856.0, ans=0.125 2023-06-27 14:39:00,981 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1829856.0, ans=0.125 2023-06-27 14:39:04,036 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1829916.0, ans=0.125 2023-06-27 14:39:20,596 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1829916.0, ans=0.1 2023-06-27 14:39:26,624 INFO [train.py:996] (0/4) Epoch 11, batch 50, loss[loss=0.2652, simple_loss=0.3465, pruned_loss=0.0919, over 21485.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.3025, pruned_loss=0.06941, over 961736.41 frames. ], batch size: 471, lr: 2.72e-03, grad_scale: 16.0 2023-06-27 14:39:33,625 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1829976.0, ans=0.0 2023-06-27 14:40:15,175 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1830096.0, ans=0.0 2023-06-27 14:41:08,863 INFO [train.py:996] (0/4) Epoch 11, batch 100, loss[loss=0.1928, simple_loss=0.2675, pruned_loss=0.05912, over 21847.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3143, pruned_loss=0.07224, over 1688681.54 frames. ], batch size: 118, lr: 2.72e-03, grad_scale: 16.0 2023-06-27 14:41:46,190 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.133e+02 5.871e+02 7.705e+02 1.160e+03 1.899e+03, threshold=1.541e+03, percent-clipped=0.0 2023-06-27 14:41:50,007 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1830396.0, ans=0.125 2023-06-27 14:41:57,095 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.64 vs. limit=15.0 2023-06-27 14:42:28,599 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1830516.0, ans=0.1 2023-06-27 14:42:51,592 INFO [train.py:996] (0/4) Epoch 11, batch 150, loss[loss=0.1809, simple_loss=0.2396, pruned_loss=0.0611, over 16366.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3181, pruned_loss=0.07223, over 2263693.75 frames. ], batch size: 64, lr: 2.72e-03, grad_scale: 16.0 2023-06-27 14:44:02,905 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1830756.0, ans=0.0 2023-06-27 14:44:33,948 INFO [train.py:996] (0/4) Epoch 11, batch 200, loss[loss=0.1993, simple_loss=0.2738, pruned_loss=0.06241, over 21868.00 frames. ], tot_loss[loss=0.228, simple_loss=0.3146, pruned_loss=0.07076, over 2706876.45 frames. ], batch size: 283, lr: 2.72e-03, grad_scale: 16.0 2023-06-27 14:44:43,342 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.16 vs. limit=10.0 2023-06-27 14:45:11,943 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.129e+02 7.270e+02 1.005e+03 1.466e+03 4.683e+03, threshold=2.010e+03, percent-clipped=22.0 2023-06-27 14:46:17,376 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1831176.0, ans=10.0 2023-06-27 14:46:18,451 INFO [train.py:996] (0/4) Epoch 11, batch 250, loss[loss=0.2346, simple_loss=0.3439, pruned_loss=0.06264, over 19781.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3105, pruned_loss=0.07116, over 3048692.27 frames. ], batch size: 703, lr: 2.72e-03, grad_scale: 16.0 2023-06-27 14:46:22,284 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1831176.0, ans=0.2 2023-06-27 14:46:46,852 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1831236.0, ans=0.125 2023-06-27 14:47:16,119 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.21 vs. limit=15.0 2023-06-27 14:48:01,911 INFO [train.py:996] (0/4) Epoch 11, batch 300, loss[loss=0.1809, simple_loss=0.2487, pruned_loss=0.05654, over 21092.00 frames. ], tot_loss[loss=0.223, simple_loss=0.3048, pruned_loss=0.0706, over 3325667.98 frames. ], batch size: 607, lr: 2.72e-03, grad_scale: 16.0 2023-06-27 14:48:17,623 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1831536.0, ans=0.1 2023-06-27 14:48:40,912 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.322e+02 6.333e+02 9.156e+02 1.285e+03 2.394e+03, threshold=1.831e+03, percent-clipped=6.0 2023-06-27 14:48:41,633 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1831596.0, ans=0.125 2023-06-27 14:48:43,441 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1831596.0, ans=0.1 2023-06-27 14:49:43,754 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.94 vs. limit=15.0 2023-06-27 14:49:47,633 INFO [train.py:996] (0/4) Epoch 11, batch 350, loss[loss=0.1964, simple_loss=0.2614, pruned_loss=0.06572, over 21479.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2973, pruned_loss=0.06873, over 3519599.30 frames. ], batch size: 195, lr: 2.72e-03, grad_scale: 16.0 2023-06-27 14:49:53,507 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1831776.0, ans=0.09899494936611666 2023-06-27 14:50:08,888 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1831836.0, ans=0.125 2023-06-27 14:50:10,715 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1831836.0, ans=0.125 2023-06-27 14:50:15,765 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1831836.0, ans=0.2 2023-06-27 14:50:33,303 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1831896.0, ans=0.125 2023-06-27 14:51:25,910 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1832016.0, ans=0.125 2023-06-27 14:51:30,045 INFO [train.py:996] (0/4) Epoch 11, batch 400, loss[loss=0.2365, simple_loss=0.3425, pruned_loss=0.06521, over 21847.00 frames. ], tot_loss[loss=0.2129, simple_loss=0.2925, pruned_loss=0.06663, over 3684911.47 frames. ], batch size: 316, lr: 2.72e-03, grad_scale: 32.0 2023-06-27 14:51:38,813 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1832076.0, ans=0.0 2023-06-27 14:51:38,895 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1832076.0, ans=0.125 2023-06-27 14:51:45,456 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1832136.0, ans=0.1 2023-06-27 14:52:09,629 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.211e+02 7.477e+02 1.167e+03 1.835e+03 4.227e+03, threshold=2.334e+03, percent-clipped=25.0 2023-06-27 14:52:10,748 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.77 vs. limit=15.0 2023-06-27 14:52:27,071 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1832196.0, ans=0.0 2023-06-27 14:52:35,244 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 14:52:41,499 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1832256.0, ans=0.09899494936611666 2023-06-27 14:53:10,813 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.10 vs. limit=10.0 2023-06-27 14:53:12,789 INFO [train.py:996] (0/4) Epoch 11, batch 450, loss[loss=0.1802, simple_loss=0.2467, pruned_loss=0.05685, over 21586.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.2893, pruned_loss=0.06593, over 3817409.03 frames. ], batch size: 263, lr: 2.72e-03, grad_scale: 16.0 2023-06-27 14:54:57,321 INFO [train.py:996] (0/4) Epoch 11, batch 500, loss[loss=0.2253, simple_loss=0.3021, pruned_loss=0.07431, over 21294.00 frames. ], tot_loss[loss=0.2084, simple_loss=0.2872, pruned_loss=0.06477, over 3916689.52 frames. ], batch size: 159, lr: 2.72e-03, grad_scale: 16.0 2023-06-27 14:55:02,776 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1832676.0, ans=0.125 2023-06-27 14:55:27,733 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1832736.0, ans=0.2 2023-06-27 14:55:36,109 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1832796.0, ans=0.125 2023-06-27 14:55:37,225 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.096e+02 9.470e+02 1.676e+03 2.580e+03 4.364e+03, threshold=3.351e+03, percent-clipped=30.0 2023-06-27 14:55:50,660 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1832796.0, ans=0.07 2023-06-27 14:56:39,107 INFO [train.py:996] (0/4) Epoch 11, batch 550, loss[loss=0.246, simple_loss=0.3147, pruned_loss=0.08862, over 21746.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2919, pruned_loss=0.06522, over 3998471.93 frames. ], batch size: 441, lr: 2.72e-03, grad_scale: 16.0 2023-06-27 14:57:11,004 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_na.min_abs, batch_count=1833036.0, ans=0.02 2023-06-27 14:58:11,172 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1833216.0, ans=0.0 2023-06-27 14:58:22,118 INFO [train.py:996] (0/4) Epoch 11, batch 600, loss[loss=0.2121, simple_loss=0.2906, pruned_loss=0.06678, over 21675.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2973, pruned_loss=0.06542, over 4066134.54 frames. ], batch size: 263, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 14:58:22,855 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1833276.0, ans=0.05 2023-06-27 14:58:23,499 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.72 vs. limit=15.0 2023-06-27 14:59:00,930 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.101e+02 6.551e+02 9.996e+02 1.452e+03 3.285e+03, threshold=1.999e+03, percent-clipped=0.0 2023-06-27 14:59:11,375 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1833396.0, ans=0.025 2023-06-27 14:59:16,536 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1833396.0, ans=0.0 2023-06-27 14:59:41,062 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1833516.0, ans=0.2 2023-06-27 14:59:42,706 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1833516.0, ans=0.125 2023-06-27 15:00:03,724 INFO [train.py:996] (0/4) Epoch 11, batch 650, loss[loss=0.2032, simple_loss=0.2747, pruned_loss=0.06586, over 21685.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.2991, pruned_loss=0.06628, over 4119146.90 frames. ], batch size: 316, lr: 2.71e-03, grad_scale: 8.0 2023-06-27 15:00:40,650 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1833696.0, ans=0.125 2023-06-27 15:01:04,898 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 15:01:39,107 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1833876.0, ans=0.0 2023-06-27 15:01:39,989 INFO [train.py:996] (0/4) Epoch 11, batch 700, loss[loss=0.2131, simple_loss=0.3002, pruned_loss=0.06301, over 21841.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2991, pruned_loss=0.06698, over 4155892.57 frames. ], batch size: 124, lr: 2.71e-03, grad_scale: 8.0 2023-06-27 15:02:26,189 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.643e+02 7.317e+02 1.195e+03 1.924e+03 5.182e+03, threshold=2.390e+03, percent-clipped=22.0 2023-06-27 15:03:26,552 INFO [train.py:996] (0/4) Epoch 11, batch 750, loss[loss=0.2314, simple_loss=0.346, pruned_loss=0.05835, over 21728.00 frames. ], tot_loss[loss=0.215, simple_loss=0.2964, pruned_loss=0.06679, over 4190056.32 frames. ], batch size: 414, lr: 2.71e-03, grad_scale: 8.0 2023-06-27 15:03:28,727 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1834176.0, ans=0.1 2023-06-27 15:03:30,526 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1834176.0, ans=0.0 2023-06-27 15:04:02,323 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1834236.0, ans=10.0 2023-06-27 15:04:06,117 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1834296.0, ans=0.125 2023-06-27 15:05:09,840 INFO [train.py:996] (0/4) Epoch 11, batch 800, loss[loss=0.1872, simple_loss=0.261, pruned_loss=0.05673, over 21367.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.2948, pruned_loss=0.06715, over 4213703.90 frames. ], batch size: 211, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:05:30,667 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1834536.0, ans=0.2 2023-06-27 15:05:51,249 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.162e+02 6.690e+02 1.036e+03 1.625e+03 3.290e+03, threshold=2.071e+03, percent-clipped=5.0 2023-06-27 15:06:53,139 INFO [train.py:996] (0/4) Epoch 11, batch 850, loss[loss=0.2099, simple_loss=0.2792, pruned_loss=0.0703, over 21889.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.293, pruned_loss=0.06741, over 4230825.18 frames. ], batch size: 332, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:06:56,927 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1834776.0, ans=0.125 2023-06-27 15:08:32,837 INFO [train.py:996] (0/4) Epoch 11, batch 900, loss[loss=0.2116, simple_loss=0.2835, pruned_loss=0.0698, over 21758.00 frames. ], tot_loss[loss=0.212, simple_loss=0.2898, pruned_loss=0.06706, over 4233369.65 frames. ], batch size: 391, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:09:18,557 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.069e+02 6.963e+02 1.051e+03 1.568e+03 3.283e+03, threshold=2.103e+03, percent-clipped=8.0 2023-06-27 15:10:10,467 INFO [train.py:996] (0/4) Epoch 11, batch 950, loss[loss=0.2021, simple_loss=0.274, pruned_loss=0.06511, over 21252.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.2873, pruned_loss=0.06688, over 4248749.64 frames. ], batch size: 159, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:11:02,323 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1835496.0, ans=0.125 2023-06-27 15:11:27,087 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1835556.0, ans=0.125 2023-06-27 15:11:53,119 INFO [train.py:996] (0/4) Epoch 11, batch 1000, loss[loss=0.2296, simple_loss=0.2989, pruned_loss=0.08012, over 21865.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.2871, pruned_loss=0.06692, over 4262949.07 frames. ], batch size: 414, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:12:19,126 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1835736.0, ans=0.0 2023-06-27 15:12:22,274 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 15:12:41,728 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1835796.0, ans=0.0 2023-06-27 15:12:44,391 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.633e+02 7.396e+02 1.258e+03 1.842e+03 3.420e+03, threshold=2.515e+03, percent-clipped=20.0 2023-06-27 15:13:36,704 INFO [train.py:996] (0/4) Epoch 11, batch 1050, loss[loss=0.1964, simple_loss=0.2731, pruned_loss=0.05983, over 21418.00 frames. ], tot_loss[loss=0.21, simple_loss=0.2871, pruned_loss=0.06644, over 4266879.75 frames. ], batch size: 194, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:14:24,616 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1836096.0, ans=0.125 2023-06-27 15:14:56,718 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1836156.0, ans=0.125 2023-06-27 15:14:58,873 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.56 vs. limit=15.0 2023-06-27 15:15:10,200 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1836216.0, ans=0.0 2023-06-27 15:15:26,364 INFO [train.py:996] (0/4) Epoch 11, batch 1100, loss[loss=0.2065, simple_loss=0.2845, pruned_loss=0.06427, over 21442.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.2889, pruned_loss=0.06724, over 4274196.72 frames. ], batch size: 548, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:15:35,370 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1836276.0, ans=0.1 2023-06-27 15:15:48,295 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.05 vs. limit=15.0 2023-06-27 15:16:05,630 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1836336.0, ans=0.125 2023-06-27 15:16:13,024 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.192e+02 8.562e+02 1.240e+03 1.886e+03 2.880e+03, threshold=2.480e+03, percent-clipped=5.0 2023-06-27 15:16:35,988 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1836456.0, ans=0.125 2023-06-27 15:16:41,237 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.29 vs. limit=15.0 2023-06-27 15:16:45,797 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1836456.0, ans=0.125 2023-06-27 15:16:48,210 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.84 vs. limit=15.0 2023-06-27 15:17:09,915 INFO [train.py:996] (0/4) Epoch 11, batch 1150, loss[loss=0.1718, simple_loss=0.2541, pruned_loss=0.04481, over 21362.00 frames. ], tot_loss[loss=0.211, simple_loss=0.2889, pruned_loss=0.06651, over 4279918.63 frames. ], batch size: 131, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:18:40,635 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1836816.0, ans=0.125 2023-06-27 15:18:44,087 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1836816.0, ans=0.125 2023-06-27 15:18:53,538 INFO [train.py:996] (0/4) Epoch 11, batch 1200, loss[loss=0.2344, simple_loss=0.3218, pruned_loss=0.07349, over 21583.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.2913, pruned_loss=0.06676, over 4283076.98 frames. ], batch size: 441, lr: 2.71e-03, grad_scale: 32.0 2023-06-27 15:18:54,188 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1836876.0, ans=0.125 2023-06-27 15:19:28,041 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1836936.0, ans=0.125 2023-06-27 15:19:47,119 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.628e+02 7.428e+02 1.142e+03 1.630e+03 3.056e+03, threshold=2.284e+03, percent-clipped=6.0 2023-06-27 15:20:17,910 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1837056.0, ans=0.1 2023-06-27 15:20:19,455 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1837116.0, ans=0.125 2023-06-27 15:20:37,522 INFO [train.py:996] (0/4) Epoch 11, batch 1250, loss[loss=0.216, simple_loss=0.2951, pruned_loss=0.06847, over 21758.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.2943, pruned_loss=0.06751, over 4283867.04 frames. ], batch size: 112, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:21:36,333 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1837296.0, ans=0.1 2023-06-27 15:21:49,717 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1837356.0, ans=0.1 2023-06-27 15:21:59,500 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1837356.0, ans=0.0 2023-06-27 15:22:21,881 INFO [train.py:996] (0/4) Epoch 11, batch 1300, loss[loss=0.2847, simple_loss=0.3452, pruned_loss=0.1121, over 21678.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2959, pruned_loss=0.06814, over 4282646.27 frames. ], batch size: 507, lr: 2.71e-03, grad_scale: 8.0 2023-06-27 15:22:54,694 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.26 vs. limit=12.0 2023-06-27 15:23:00,668 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 15:23:04,051 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1837596.0, ans=0.0 2023-06-27 15:23:16,824 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.358e+02 6.400e+02 8.214e+02 1.269e+03 2.290e+03, threshold=1.643e+03, percent-clipped=1.0 2023-06-27 15:23:52,430 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.82 vs. limit=12.0 2023-06-27 15:24:11,940 INFO [train.py:996] (0/4) Epoch 11, batch 1350, loss[loss=0.2438, simple_loss=0.3194, pruned_loss=0.08408, over 21595.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.2967, pruned_loss=0.06848, over 4291571.07 frames. ], batch size: 415, lr: 2.71e-03, grad_scale: 8.0 2023-06-27 15:25:28,088 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1837956.0, ans=0.125 2023-06-27 15:25:56,187 INFO [train.py:996] (0/4) Epoch 11, batch 1400, loss[loss=0.2793, simple_loss=0.3403, pruned_loss=0.1091, over 21649.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.2954, pruned_loss=0.06923, over 4290220.70 frames. ], batch size: 507, lr: 2.71e-03, grad_scale: 8.0 2023-06-27 15:26:19,521 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.83 vs. limit=15.0 2023-06-27 15:26:29,589 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.23 vs. limit=12.0 2023-06-27 15:26:39,063 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1838196.0, ans=0.125 2023-06-27 15:26:46,613 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.209e+02 7.064e+02 1.087e+03 1.603e+03 3.118e+03, threshold=2.174e+03, percent-clipped=20.0 2023-06-27 15:27:30,238 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1838316.0, ans=0.125 2023-06-27 15:27:30,846 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.42 vs. limit=15.0 2023-06-27 15:27:39,798 INFO [train.py:996] (0/4) Epoch 11, batch 1450, loss[loss=0.1975, simple_loss=0.2881, pruned_loss=0.05344, over 21458.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2952, pruned_loss=0.0696, over 4291248.67 frames. ], batch size: 211, lr: 2.71e-03, grad_scale: 8.0 2023-06-27 15:27:55,395 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1838376.0, ans=0.125 2023-06-27 15:28:06,684 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1838436.0, ans=0.2 2023-06-27 15:28:57,977 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1838556.0, ans=0.125 2023-06-27 15:29:05,337 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.08 vs. limit=15.0 2023-06-27 15:29:21,707 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1838616.0, ans=0.1 2023-06-27 15:29:28,823 INFO [train.py:996] (0/4) Epoch 11, batch 1500, loss[loss=0.2009, simple_loss=0.3052, pruned_loss=0.04827, over 20941.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.2973, pruned_loss=0.07086, over 4296224.86 frames. ], batch size: 608, lr: 2.71e-03, grad_scale: 8.0 2023-06-27 15:29:50,028 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1838736.0, ans=0.125 2023-06-27 15:30:14,622 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.898e+02 7.080e+02 9.690e+02 1.530e+03 3.266e+03, threshold=1.938e+03, percent-clipped=8.0 2023-06-27 15:30:16,040 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.81 vs. limit=15.0 2023-06-27 15:30:21,983 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1838796.0, ans=0.125 2023-06-27 15:30:22,129 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1838796.0, ans=0.0 2023-06-27 15:30:23,801 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1838856.0, ans=0.2 2023-06-27 15:31:14,246 INFO [train.py:996] (0/4) Epoch 11, batch 1550, loss[loss=0.2509, simple_loss=0.3127, pruned_loss=0.09451, over 21301.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.2939, pruned_loss=0.06998, over 4301774.96 frames. ], batch size: 143, lr: 2.71e-03, grad_scale: 8.0 2023-06-27 15:31:20,260 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1838976.0, ans=0.1 2023-06-27 15:31:24,502 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.35 vs. limit=6.0 2023-06-27 15:31:34,602 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1839036.0, ans=0.125 2023-06-27 15:31:35,163 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.39 vs. limit=15.0 2023-06-27 15:33:01,769 INFO [train.py:996] (0/4) Epoch 11, batch 1600, loss[loss=0.2772, simple_loss=0.3435, pruned_loss=0.1055, over 21766.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2928, pruned_loss=0.06948, over 4289810.40 frames. ], batch size: 441, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:33:02,535 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1839276.0, ans=0.125 2023-06-27 15:33:06,650 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.90 vs. limit=15.0 2023-06-27 15:33:53,894 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.998e+02 6.555e+02 8.833e+02 1.502e+03 3.809e+03, threshold=1.767e+03, percent-clipped=10.0 2023-06-27 15:34:48,928 INFO [train.py:996] (0/4) Epoch 11, batch 1650, loss[loss=0.2348, simple_loss=0.3063, pruned_loss=0.08171, over 21813.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.2903, pruned_loss=0.0679, over 4290444.10 frames. ], batch size: 118, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:35:21,167 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1839636.0, ans=0.125 2023-06-27 15:35:41,589 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1839696.0, ans=0.125 2023-06-27 15:36:32,271 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1839816.0, ans=0.05 2023-06-27 15:36:37,010 INFO [train.py:996] (0/4) Epoch 11, batch 1700, loss[loss=0.2341, simple_loss=0.311, pruned_loss=0.07855, over 21449.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2929, pruned_loss=0.06809, over 4294089.98 frames. ], batch size: 131, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:36:47,941 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.56 vs. limit=15.0 2023-06-27 15:37:07,004 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1839936.0, ans=0.5 2023-06-27 15:37:35,056 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.435e+02 5.947e+02 9.216e+02 1.351e+03 2.792e+03, threshold=1.843e+03, percent-clipped=11.0 2023-06-27 15:38:05,249 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1840056.0, ans=0.1 2023-06-27 15:38:30,373 INFO [train.py:996] (0/4) Epoch 11, batch 1750, loss[loss=0.1676, simple_loss=0.2635, pruned_loss=0.0359, over 21716.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.293, pruned_loss=0.06709, over 4291585.13 frames. ], batch size: 298, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:38:31,778 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.72 vs. limit=10.0 2023-06-27 15:38:59,007 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1840236.0, ans=0.125 2023-06-27 15:39:13,177 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1840236.0, ans=0.125 2023-06-27 15:39:26,573 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 15:39:41,445 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.07 vs. limit=15.0 2023-06-27 15:40:22,767 INFO [train.py:996] (0/4) Epoch 11, batch 1800, loss[loss=0.1629, simple_loss=0.2339, pruned_loss=0.04596, over 21280.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2914, pruned_loss=0.06553, over 4293983.40 frames. ], batch size: 131, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:40:30,491 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 15:40:57,770 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1840536.0, ans=0.125 2023-06-27 15:41:02,768 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1840536.0, ans=0.0 2023-06-27 15:41:08,006 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1840596.0, ans=0.125 2023-06-27 15:41:09,615 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1840596.0, ans=0.1 2023-06-27 15:41:13,924 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.102e+02 6.830e+02 1.090e+03 1.802e+03 4.605e+03, threshold=2.180e+03, percent-clipped=19.0 2023-06-27 15:41:14,693 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1840596.0, ans=0.125 2023-06-27 15:42:09,051 INFO [train.py:996] (0/4) Epoch 11, batch 1850, loss[loss=0.188, simple_loss=0.2709, pruned_loss=0.05257, over 19987.00 frames. ], tot_loss[loss=0.209, simple_loss=0.2913, pruned_loss=0.06332, over 4289114.11 frames. ], batch size: 702, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:42:26,345 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1840776.0, ans=0.1 2023-06-27 15:43:53,668 INFO [train.py:996] (0/4) Epoch 11, batch 1900, loss[loss=0.1971, simple_loss=0.2605, pruned_loss=0.06686, over 21224.00 frames. ], tot_loss[loss=0.2085, simple_loss=0.2911, pruned_loss=0.06295, over 4284041.91 frames. ], batch size: 608, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:44:20,990 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1841136.0, ans=0.125 2023-06-27 15:44:22,605 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1841136.0, ans=0.04949747468305833 2023-06-27 15:44:24,303 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1841136.0, ans=0.0 2023-06-27 15:44:30,713 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1841136.0, ans=0.0 2023-06-27 15:44:43,247 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.131e+02 8.434e+02 1.477e+03 2.094e+03 4.159e+03, threshold=2.954e+03, percent-clipped=22.0 2023-06-27 15:44:57,619 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1841256.0, ans=0.1 2023-06-27 15:45:41,635 INFO [train.py:996] (0/4) Epoch 11, batch 1950, loss[loss=0.2486, simple_loss=0.352, pruned_loss=0.07256, over 21192.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.2905, pruned_loss=0.06395, over 4280773.69 frames. ], batch size: 548, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:45:42,302 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1841376.0, ans=0.125 2023-06-27 15:46:20,517 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1841496.0, ans=0.125 2023-06-27 15:47:26,635 INFO [train.py:996] (0/4) Epoch 11, batch 2000, loss[loss=0.1622, simple_loss=0.2368, pruned_loss=0.04384, over 21827.00 frames. ], tot_loss[loss=0.206, simple_loss=0.2859, pruned_loss=0.06305, over 4271010.76 frames. ], batch size: 102, lr: 2.71e-03, grad_scale: 32.0 2023-06-27 15:47:36,295 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.61 vs. limit=22.5 2023-06-27 15:47:52,434 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 15:48:13,536 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.255e+02 7.614e+02 1.079e+03 2.039e+03 3.848e+03, threshold=2.158e+03, percent-clipped=8.0 2023-06-27 15:48:48,785 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1841916.0, ans=0.0 2023-06-27 15:49:05,696 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.66 vs. limit=15.0 2023-06-27 15:49:09,577 INFO [train.py:996] (0/4) Epoch 11, batch 2050, loss[loss=0.2112, simple_loss=0.2877, pruned_loss=0.0673, over 21869.00 frames. ], tot_loss[loss=0.2054, simple_loss=0.2853, pruned_loss=0.06275, over 4274786.58 frames. ], batch size: 118, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:49:10,314 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1841976.0, ans=0.0 2023-06-27 15:49:27,308 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.70 vs. limit=10.0 2023-06-27 15:49:38,932 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1842036.0, ans=0.0 2023-06-27 15:50:33,431 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1842216.0, ans=0.125 2023-06-27 15:50:36,673 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1842216.0, ans=0.0 2023-06-27 15:50:58,302 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1842276.0, ans=0.125 2023-06-27 15:50:59,227 INFO [train.py:996] (0/4) Epoch 11, batch 2100, loss[loss=0.2327, simple_loss=0.3216, pruned_loss=0.07184, over 21896.00 frames. ], tot_loss[loss=0.2078, simple_loss=0.2875, pruned_loss=0.06401, over 4280237.09 frames. ], batch size: 371, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:51:12,561 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=10.04 vs. limit=15.0 2023-06-27 15:51:33,283 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1842396.0, ans=0.125 2023-06-27 15:51:46,351 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.397e+02 7.542e+02 1.130e+03 1.676e+03 4.140e+03, threshold=2.259e+03, percent-clipped=14.0 2023-06-27 15:52:44,203 INFO [train.py:996] (0/4) Epoch 11, batch 2150, loss[loss=0.2135, simple_loss=0.2933, pruned_loss=0.06684, over 15770.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2918, pruned_loss=0.06549, over 4264701.88 frames. ], batch size: 60, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:53:28,008 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.66 vs. limit=22.5 2023-06-27 15:54:29,213 INFO [train.py:996] (0/4) Epoch 11, batch 2200, loss[loss=0.2297, simple_loss=0.2974, pruned_loss=0.08097, over 21432.00 frames. ], tot_loss[loss=0.2121, simple_loss=0.2922, pruned_loss=0.06597, over 4262181.32 frames. ], batch size: 471, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:54:35,071 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1842876.0, ans=0.0 2023-06-27 15:55:16,679 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.261e+02 6.339e+02 9.896e+02 1.686e+03 3.946e+03, threshold=1.979e+03, percent-clipped=15.0 2023-06-27 15:56:14,356 INFO [train.py:996] (0/4) Epoch 11, batch 2250, loss[loss=0.2072, simple_loss=0.2836, pruned_loss=0.06542, over 21799.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.2919, pruned_loss=0.06464, over 4259151.68 frames. ], batch size: 351, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:56:51,833 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1843296.0, ans=0.0 2023-06-27 15:57:28,532 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1843356.0, ans=0.0 2023-06-27 15:57:46,272 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1843416.0, ans=0.0 2023-06-27 15:57:52,261 INFO [train.py:996] (0/4) Epoch 11, batch 2300, loss[loss=0.1833, simple_loss=0.2568, pruned_loss=0.05489, over 21674.00 frames. ], tot_loss[loss=0.2071, simple_loss=0.2871, pruned_loss=0.06349, over 4252088.16 frames. ], batch size: 316, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:58:39,377 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.760e+02 6.436e+02 1.038e+03 1.737e+03 5.031e+03, threshold=2.076e+03, percent-clipped=15.0 2023-06-27 15:59:06,642 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1843656.0, ans=0.125 2023-06-27 15:59:23,152 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1843716.0, ans=0.125 2023-06-27 15:59:36,622 INFO [train.py:996] (0/4) Epoch 11, batch 2350, loss[loss=0.1973, simple_loss=0.2733, pruned_loss=0.06064, over 21954.00 frames. ], tot_loss[loss=0.2041, simple_loss=0.2813, pruned_loss=0.0635, over 4252868.90 frames. ], batch size: 103, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:59:43,865 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1843776.0, ans=0.0 2023-06-27 15:59:57,610 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1843836.0, ans=0.125 2023-06-27 16:00:48,975 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1843956.0, ans=0.125 2023-06-27 16:01:21,956 INFO [train.py:996] (0/4) Epoch 11, batch 2400, loss[loss=0.2479, simple_loss=0.3562, pruned_loss=0.06983, over 16829.00 frames. ], tot_loss[loss=0.2061, simple_loss=0.2828, pruned_loss=0.06472, over 4258926.42 frames. ], batch size: 60, lr: 2.71e-03, grad_scale: 32.0 2023-06-27 16:01:36,846 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.40 vs. limit=6.0 2023-06-27 16:01:39,441 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1844136.0, ans=0.125 2023-06-27 16:02:12,449 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1844196.0, ans=0.1 2023-06-27 16:02:21,960 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.317e+02 6.915e+02 1.084e+03 1.714e+03 3.712e+03, threshold=2.167e+03, percent-clipped=11.0 2023-06-27 16:02:55,059 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.42 vs. limit=22.5 2023-06-27 16:02:59,786 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=1844316.0, ans=15.0 2023-06-27 16:03:07,404 INFO [train.py:996] (0/4) Epoch 11, batch 2450, loss[loss=0.2002, simple_loss=0.2649, pruned_loss=0.06771, over 21547.00 frames. ], tot_loss[loss=0.211, simple_loss=0.2869, pruned_loss=0.06756, over 4264358.33 frames. ], batch size: 263, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 16:03:13,517 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1844376.0, ans=0.1 2023-06-27 16:03:18,843 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.44 vs. limit=10.0 2023-06-27 16:03:57,848 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1844496.0, ans=0.125 2023-06-27 16:04:11,036 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1844556.0, ans=0.2 2023-06-27 16:04:50,006 INFO [train.py:996] (0/4) Epoch 11, batch 2500, loss[loss=0.1929, simple_loss=0.276, pruned_loss=0.0549, over 21783.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.2862, pruned_loss=0.06781, over 4274628.29 frames. ], batch size: 112, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 16:04:52,258 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff3.min_abs, batch_count=1844676.0, ans=0.2 2023-06-27 16:04:58,746 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1844676.0, ans=0.09899494936611666 2023-06-27 16:05:05,554 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1844736.0, ans=0.125 2023-06-27 16:05:10,713 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1844736.0, ans=0.0 2023-06-27 16:05:19,297 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1844736.0, ans=0.2 2023-06-27 16:05:42,662 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1844796.0, ans=0.015 2023-06-27 16:05:43,708 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.461e+02 7.979e+02 1.093e+03 1.704e+03 3.202e+03, threshold=2.185e+03, percent-clipped=12.0 2023-06-27 16:06:15,809 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1844916.0, ans=0.125 2023-06-27 16:06:18,328 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.65 vs. limit=15.0 2023-06-27 16:06:34,033 INFO [train.py:996] (0/4) Epoch 11, batch 2550, loss[loss=0.2563, simple_loss=0.3728, pruned_loss=0.06995, over 19702.00 frames. ], tot_loss[loss=0.21, simple_loss=0.2863, pruned_loss=0.06681, over 4268211.58 frames. ], batch size: 702, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 16:06:53,513 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.23 vs. limit=10.0 2023-06-27 16:07:10,129 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1845036.0, ans=0.1 2023-06-27 16:07:43,469 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_na.min_abs, batch_count=1845156.0, ans=0.02 2023-06-27 16:08:18,052 INFO [train.py:996] (0/4) Epoch 11, batch 2600, loss[loss=0.2452, simple_loss=0.3244, pruned_loss=0.08299, over 21486.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2872, pruned_loss=0.06783, over 4269504.23 frames. ], batch size: 131, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 16:09:12,265 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.122e+02 7.338e+02 1.284e+03 1.915e+03 4.312e+03, threshold=2.567e+03, percent-clipped=18.0 2023-06-27 16:09:58,132 INFO [train.py:996] (0/4) Epoch 11, batch 2650, loss[loss=0.2572, simple_loss=0.3181, pruned_loss=0.09809, over 21614.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2888, pruned_loss=0.06945, over 4274380.79 frames. ], batch size: 471, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 16:09:58,890 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1845576.0, ans=0.125 2023-06-27 16:11:04,641 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.34 vs. limit=6.0 2023-06-27 16:11:43,781 INFO [train.py:996] (0/4) Epoch 11, batch 2700, loss[loss=0.1916, simple_loss=0.2725, pruned_loss=0.05532, over 21787.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2869, pruned_loss=0.0692, over 4280752.51 frames. ], batch size: 316, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 16:11:47,823 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1845876.0, ans=0.0 2023-06-27 16:12:43,565 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.729e+02 6.625e+02 9.246e+02 1.409e+03 2.648e+03, threshold=1.849e+03, percent-clipped=2.0 2023-06-27 16:13:28,866 INFO [train.py:996] (0/4) Epoch 11, batch 2750, loss[loss=0.219, simple_loss=0.3001, pruned_loss=0.06889, over 21849.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2871, pruned_loss=0.06923, over 4281230.82 frames. ], batch size: 112, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 16:13:34,780 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 16:14:28,401 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.14 vs. limit=15.0 2023-06-27 16:15:04,541 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1846416.0, ans=0.125 2023-06-27 16:15:08,601 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.35 vs. limit=15.0 2023-06-27 16:15:15,728 INFO [train.py:996] (0/4) Epoch 11, batch 2800, loss[loss=0.2276, simple_loss=0.3526, pruned_loss=0.05129, over 19725.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2919, pruned_loss=0.07031, over 4278463.10 frames. ], batch size: 703, lr: 2.71e-03, grad_scale: 32.0 2023-06-27 16:15:30,422 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1846476.0, ans=0.0 2023-06-27 16:15:30,902 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.88 vs. limit=15.0 2023-06-27 16:15:46,827 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.77 vs. limit=15.0 2023-06-27 16:15:48,603 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.55 vs. limit=22.5 2023-06-27 16:15:51,483 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1846536.0, ans=0.125 2023-06-27 16:15:51,554 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1846536.0, ans=0.125 2023-06-27 16:16:18,332 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.904e+02 7.981e+02 1.210e+03 1.745e+03 3.756e+03, threshold=2.419e+03, percent-clipped=24.0 2023-06-27 16:16:41,488 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1846716.0, ans=0.0 2023-06-27 16:16:46,563 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1846716.0, ans=0.0 2023-06-27 16:17:03,367 INFO [train.py:996] (0/4) Epoch 11, batch 2850, loss[loss=0.1996, simple_loss=0.281, pruned_loss=0.05906, over 21852.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2922, pruned_loss=0.07093, over 4282365.28 frames. ], batch size: 372, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 16:18:03,641 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1846896.0, ans=0.04949747468305833 2023-06-27 16:18:05,266 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1846896.0, ans=0.07 2023-06-27 16:18:06,981 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1846956.0, ans=0.2 2023-06-27 16:18:40,554 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1847076.0, ans=0.2 2023-06-27 16:18:41,467 INFO [train.py:996] (0/4) Epoch 11, batch 2900, loss[loss=0.2302, simple_loss=0.3103, pruned_loss=0.07506, over 21895.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2926, pruned_loss=0.07087, over 4289208.71 frames. ], batch size: 124, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 16:18:53,196 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1847076.0, ans=0.125 2023-06-27 16:18:58,856 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.20 vs. limit=15.0 2023-06-27 16:19:24,810 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1847136.0, ans=0.125 2023-06-27 16:19:45,512 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.568e+02 6.840e+02 9.553e+02 1.645e+03 3.808e+03, threshold=1.911e+03, percent-clipped=8.0 2023-06-27 16:20:05,324 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1847256.0, ans=0.0 2023-06-27 16:20:20,675 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1847316.0, ans=0.0 2023-06-27 16:20:25,215 INFO [train.py:996] (0/4) Epoch 11, batch 2950, loss[loss=0.2241, simple_loss=0.3027, pruned_loss=0.07276, over 21148.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.2944, pruned_loss=0.07151, over 4292587.91 frames. ], batch size: 143, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 16:20:34,188 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1847376.0, ans=0.1 2023-06-27 16:20:37,648 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1847376.0, ans=0.0 2023-06-27 16:20:42,780 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1847376.0, ans=0.0 2023-06-27 16:21:09,630 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.50 vs. limit=22.5 2023-06-27 16:22:14,912 INFO [train.py:996] (0/4) Epoch 11, batch 3000, loss[loss=0.2301, simple_loss=0.3165, pruned_loss=0.0718, over 21505.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.2976, pruned_loss=0.07049, over 4293188.98 frames. ], batch size: 131, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 16:22:14,914 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-27 16:22:35,495 INFO [train.py:1028] (0/4) Epoch 11, validation: loss=0.2528, simple_loss=0.3433, pruned_loss=0.08109, over 1796401.00 frames. 2023-06-27 16:22:35,496 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23714MB 2023-06-27 16:23:27,444 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.016e+02 6.559e+02 9.881e+02 1.581e+03 3.511e+03, threshold=1.976e+03, percent-clipped=15.0 2023-06-27 16:23:43,345 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1847856.0, ans=0.125 2023-06-27 16:24:00,358 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1847916.0, ans=0.0 2023-06-27 16:24:11,907 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1847916.0, ans=10.0 2023-06-27 16:24:16,758 INFO [train.py:996] (0/4) Epoch 11, batch 3050, loss[loss=0.1761, simple_loss=0.2627, pruned_loss=0.0447, over 21770.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2977, pruned_loss=0.06867, over 4290762.74 frames. ], batch size: 247, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 16:24:19,771 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.27 vs. limit=15.0 2023-06-27 16:24:22,081 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-308000.pt 2023-06-27 16:24:51,614 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1848036.0, ans=0.1 2023-06-27 16:24:53,364 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1848036.0, ans=0.07 2023-06-27 16:26:03,796 INFO [train.py:996] (0/4) Epoch 11, batch 3100, loss[loss=0.2119, simple_loss=0.3003, pruned_loss=0.06181, over 21809.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2982, pruned_loss=0.06818, over 4295198.45 frames. ], batch size: 298, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 16:26:54,554 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.218e+02 9.868e+02 1.604e+03 2.316e+03 3.970e+03, threshold=3.207e+03, percent-clipped=39.0 2023-06-27 16:27:03,629 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1848456.0, ans=0.125 2023-06-27 16:27:47,946 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1848516.0, ans=0.125 2023-06-27 16:27:54,283 INFO [train.py:996] (0/4) Epoch 11, batch 3150, loss[loss=0.2322, simple_loss=0.3114, pruned_loss=0.07648, over 21757.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.2991, pruned_loss=0.06827, over 4299177.50 frames. ], batch size: 332, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 16:28:10,611 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1848636.0, ans=0.2 2023-06-27 16:28:19,663 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.65 vs. limit=22.5 2023-06-27 16:28:24,143 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1848636.0, ans=0.0 2023-06-27 16:28:27,754 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1848696.0, ans=0.1 2023-06-27 16:29:23,898 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1848816.0, ans=0.125 2023-06-27 16:29:40,791 INFO [train.py:996] (0/4) Epoch 11, batch 3200, loss[loss=0.195, simple_loss=0.2829, pruned_loss=0.05358, over 21453.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.3012, pruned_loss=0.06876, over 4295050.92 frames. ], batch size: 194, lr: 2.70e-03, grad_scale: 32.0 2023-06-27 16:30:32,470 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.25 vs. limit=15.0 2023-06-27 16:30:42,999 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.812e+02 8.312e+02 1.188e+03 1.817e+03 3.495e+03, threshold=2.376e+03, percent-clipped=3.0 2023-06-27 16:30:57,146 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1849056.0, ans=0.125 2023-06-27 16:31:25,351 INFO [train.py:996] (0/4) Epoch 11, batch 3250, loss[loss=0.2291, simple_loss=0.3058, pruned_loss=0.07621, over 21622.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.3028, pruned_loss=0.07013, over 4290876.62 frames. ], batch size: 263, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 16:31:27,710 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1849176.0, ans=0.1 2023-06-27 16:32:06,888 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=1849236.0, ans=0.5 2023-06-27 16:32:50,928 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1849356.0, ans=0.125 2023-06-27 16:33:11,137 INFO [train.py:996] (0/4) Epoch 11, batch 3300, loss[loss=0.2574, simple_loss=0.3207, pruned_loss=0.09707, over 21378.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.2976, pruned_loss=0.06948, over 4282493.24 frames. ], batch size: 507, lr: 2.70e-03, grad_scale: 8.0 2023-06-27 16:33:44,245 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1849536.0, ans=0.125 2023-06-27 16:33:55,858 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1849596.0, ans=0.125 2023-06-27 16:34:01,035 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1849596.0, ans=0.1 2023-06-27 16:34:06,231 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1849596.0, ans=0.125 2023-06-27 16:34:15,386 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.316e+02 6.747e+02 1.095e+03 2.044e+03 4.676e+03, threshold=2.190e+03, percent-clipped=15.0 2023-06-27 16:34:50,726 INFO [train.py:996] (0/4) Epoch 11, batch 3350, loss[loss=0.2661, simple_loss=0.3469, pruned_loss=0.09264, over 21424.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.2999, pruned_loss=0.06959, over 4279828.41 frames. ], batch size: 471, lr: 2.70e-03, grad_scale: 8.0 2023-06-27 16:35:14,404 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.87 vs. limit=12.0 2023-06-27 16:36:21,379 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 16:36:35,656 INFO [train.py:996] (0/4) Epoch 11, batch 3400, loss[loss=0.2125, simple_loss=0.2907, pruned_loss=0.06714, over 21862.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.2995, pruned_loss=0.07004, over 4281147.33 frames. ], batch size: 372, lr: 2.70e-03, grad_scale: 8.0 2023-06-27 16:37:43,320 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.448e+02 6.438e+02 9.614e+02 1.434e+03 2.571e+03, threshold=1.923e+03, percent-clipped=1.0 2023-06-27 16:38:11,705 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 16:38:24,830 INFO [train.py:996] (0/4) Epoch 11, batch 3450, loss[loss=0.2312, simple_loss=0.2759, pruned_loss=0.0933, over 21488.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2957, pruned_loss=0.06977, over 4281562.66 frames. ], batch size: 510, lr: 2.70e-03, grad_scale: 8.0 2023-06-27 16:40:15,602 INFO [train.py:996] (0/4) Epoch 11, batch 3500, loss[loss=0.1985, simple_loss=0.279, pruned_loss=0.05896, over 21798.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.3021, pruned_loss=0.0725, over 4278858.88 frames. ], batch size: 107, lr: 2.70e-03, grad_scale: 8.0 2023-06-27 16:41:11,313 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1850796.0, ans=0.0 2023-06-27 16:41:14,144 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.683e+02 8.214e+02 1.340e+03 2.218e+03 5.014e+03, threshold=2.681e+03, percent-clipped=29.0 2023-06-27 16:41:26,647 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1850856.0, ans=0.1 2023-06-27 16:42:05,011 INFO [train.py:996] (0/4) Epoch 11, batch 3550, loss[loss=0.2129, simple_loss=0.2814, pruned_loss=0.07214, over 21450.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.3032, pruned_loss=0.07281, over 4281091.46 frames. ], batch size: 389, lr: 2.70e-03, grad_scale: 8.0 2023-06-27 16:42:22,975 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.54 vs. limit=10.0 2023-06-27 16:42:32,454 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1851036.0, ans=0.125 2023-06-27 16:42:34,076 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1851036.0, ans=0.0 2023-06-27 16:43:08,483 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.75 vs. limit=10.0 2023-06-27 16:43:09,674 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1851156.0, ans=0.125 2023-06-27 16:43:49,657 INFO [train.py:996] (0/4) Epoch 11, batch 3600, loss[loss=0.1747, simple_loss=0.2427, pruned_loss=0.05335, over 21672.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.2977, pruned_loss=0.07236, over 4283689.75 frames. ], batch size: 282, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 16:44:10,858 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1851336.0, ans=0.125 2023-06-27 16:44:26,717 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1851336.0, ans=0.125 2023-06-27 16:44:44,958 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.210e+02 6.431e+02 1.048e+03 1.688e+03 3.904e+03, threshold=2.095e+03, percent-clipped=4.0 2023-06-27 16:44:47,487 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1851456.0, ans=0.0 2023-06-27 16:44:48,885 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1851456.0, ans=0.2 2023-06-27 16:45:36,121 INFO [train.py:996] (0/4) Epoch 11, batch 3650, loss[loss=0.2124, simple_loss=0.302, pruned_loss=0.06142, over 20803.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.2981, pruned_loss=0.07256, over 4279403.91 frames. ], batch size: 608, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 16:47:00,003 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1851816.0, ans=0.2 2023-06-27 16:47:13,637 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1851816.0, ans=0.1 2023-06-27 16:47:19,916 INFO [train.py:996] (0/4) Epoch 11, batch 3700, loss[loss=0.21, simple_loss=0.2921, pruned_loss=0.06401, over 21856.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.2962, pruned_loss=0.07172, over 4277543.62 frames. ], batch size: 298, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 16:48:13,920 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.881e+02 6.744e+02 1.016e+03 1.702e+03 3.129e+03, threshold=2.032e+03, percent-clipped=14.0 2023-06-27 16:48:19,684 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1852056.0, ans=0.0 2023-06-27 16:49:04,987 INFO [train.py:996] (0/4) Epoch 11, batch 3750, loss[loss=0.1811, simple_loss=0.2548, pruned_loss=0.05374, over 21636.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.2946, pruned_loss=0.07128, over 4287163.55 frames. ], batch size: 230, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 16:49:33,422 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.44 vs. limit=15.0 2023-06-27 16:50:05,759 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.44 vs. limit=22.5 2023-06-27 16:50:43,555 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1852416.0, ans=0.125 2023-06-27 16:50:49,324 INFO [train.py:996] (0/4) Epoch 11, batch 3800, loss[loss=0.241, simple_loss=0.3195, pruned_loss=0.08122, over 21816.00 frames. ], tot_loss[loss=0.215, simple_loss=0.2914, pruned_loss=0.0693, over 4289971.86 frames. ], batch size: 441, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 16:51:08,654 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1852536.0, ans=0.0 2023-06-27 16:51:11,871 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1852536.0, ans=0.125 2023-06-27 16:51:19,987 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1852536.0, ans=0.1 2023-06-27 16:51:38,523 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1852596.0, ans=0.125 2023-06-27 16:51:47,704 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.392e+02 7.087e+02 9.540e+02 1.301e+03 2.936e+03, threshold=1.908e+03, percent-clipped=6.0 2023-06-27 16:51:57,899 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1852656.0, ans=0.0 2023-06-27 16:52:10,383 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.25 vs. limit=15.0 2023-06-27 16:52:32,397 INFO [train.py:996] (0/4) Epoch 11, batch 3850, loss[loss=0.1892, simple_loss=0.2722, pruned_loss=0.05313, over 20103.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2896, pruned_loss=0.06972, over 4278084.35 frames. ], batch size: 703, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 16:52:39,927 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1852776.0, ans=0.2 2023-06-27 16:53:47,336 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1852956.0, ans=0.2 2023-06-27 16:53:52,395 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1853016.0, ans=0.125 2023-06-27 16:54:14,370 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.48 vs. limit=15.0 2023-06-27 16:54:14,660 INFO [train.py:996] (0/4) Epoch 11, batch 3900, loss[loss=0.2012, simple_loss=0.2666, pruned_loss=0.06797, over 21283.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.2846, pruned_loss=0.06921, over 4279306.26 frames. ], batch size: 176, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 16:54:18,497 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1853076.0, ans=0.0 2023-06-27 16:54:18,521 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1853076.0, ans=0.0 2023-06-27 16:54:19,109 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.14 vs. limit=15.0 2023-06-27 16:54:32,171 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1853136.0, ans=0.0 2023-06-27 16:54:49,837 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.13 vs. limit=10.0 2023-06-27 16:55:09,164 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.347e+02 6.134e+02 8.883e+02 1.369e+03 3.236e+03, threshold=1.777e+03, percent-clipped=7.0 2023-06-27 16:55:09,795 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1853256.0, ans=0.0 2023-06-27 16:55:22,089 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.72 vs. limit=15.0 2023-06-27 16:55:24,994 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1853256.0, ans=0.125 2023-06-27 16:55:54,568 INFO [train.py:996] (0/4) Epoch 11, batch 3950, loss[loss=0.1795, simple_loss=0.2778, pruned_loss=0.04055, over 21679.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2881, pruned_loss=0.06864, over 4278678.94 frames. ], batch size: 414, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 16:56:41,826 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1853496.0, ans=0.1 2023-06-27 16:57:18,410 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1853616.0, ans=0.0 2023-06-27 16:57:31,923 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1853676.0, ans=0.2 2023-06-27 16:57:32,922 INFO [train.py:996] (0/4) Epoch 11, batch 4000, loss[loss=0.1902, simple_loss=0.2531, pruned_loss=0.06364, over 21199.00 frames. ], tot_loss[loss=0.2067, simple_loss=0.2824, pruned_loss=0.06544, over 4273880.04 frames. ], batch size: 144, lr: 2.70e-03, grad_scale: 32.0 2023-06-27 16:58:22,738 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1853796.0, ans=0.125 2023-06-27 16:58:37,195 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.700e+02 6.762e+02 1.217e+03 2.027e+03 5.671e+03, threshold=2.434e+03, percent-clipped=30.0 2023-06-27 16:58:52,705 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1853856.0, ans=0.125 2023-06-27 16:59:17,778 INFO [train.py:996] (0/4) Epoch 11, batch 4050, loss[loss=0.2013, simple_loss=0.3042, pruned_loss=0.04924, over 21620.00 frames. ], tot_loss[loss=0.2058, simple_loss=0.2831, pruned_loss=0.06424, over 4270177.77 frames. ], batch size: 263, lr: 2.70e-03, grad_scale: 32.0 2023-06-27 16:59:22,303 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.24 vs. limit=15.0 2023-06-27 16:59:32,355 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.59 vs. limit=10.0 2023-06-27 17:00:03,397 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1854096.0, ans=0.125 2023-06-27 17:00:25,476 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1854156.0, ans=0.125 2023-06-27 17:00:36,633 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1854156.0, ans=0.125 2023-06-27 17:00:53,345 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1854216.0, ans=0.125 2023-06-27 17:01:01,319 INFO [train.py:996] (0/4) Epoch 11, batch 4100, loss[loss=0.231, simple_loss=0.297, pruned_loss=0.08256, over 21828.00 frames. ], tot_loss[loss=0.208, simple_loss=0.2846, pruned_loss=0.0657, over 4274785.96 frames. ], batch size: 441, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 17:01:05,866 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.07 vs. limit=15.0 2023-06-27 17:01:15,992 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1854276.0, ans=0.125 2023-06-27 17:01:44,090 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1854396.0, ans=0.025 2023-06-27 17:01:59,209 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.54 vs. limit=22.5 2023-06-27 17:02:00,492 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1854396.0, ans=0.2 2023-06-27 17:02:11,185 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.335e+02 7.301e+02 1.093e+03 1.524e+03 3.311e+03, threshold=2.186e+03, percent-clipped=4.0 2023-06-27 17:02:45,152 INFO [train.py:996] (0/4) Epoch 11, batch 4150, loss[loss=0.2113, simple_loss=0.2801, pruned_loss=0.07123, over 20004.00 frames. ], tot_loss[loss=0.2067, simple_loss=0.286, pruned_loss=0.06366, over 4271293.36 frames. ], batch size: 703, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 17:03:26,744 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1854636.0, ans=0.125 2023-06-27 17:03:28,527 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1854696.0, ans=0.0 2023-06-27 17:03:59,939 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1854756.0, ans=0.0 2023-06-27 17:04:26,506 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1854876.0, ans=0.125 2023-06-27 17:04:27,386 INFO [train.py:996] (0/4) Epoch 11, batch 4200, loss[loss=0.1851, simple_loss=0.2849, pruned_loss=0.04264, over 19800.00 frames. ], tot_loss[loss=0.2059, simple_loss=0.2852, pruned_loss=0.06337, over 4268402.99 frames. ], batch size: 703, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 17:04:40,499 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1854876.0, ans=0.07 2023-06-27 17:04:42,170 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1854876.0, ans=0.2 2023-06-27 17:05:34,646 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.415e+02 6.010e+02 8.416e+02 1.376e+03 4.083e+03, threshold=1.683e+03, percent-clipped=10.0 2023-06-27 17:06:14,236 INFO [train.py:996] (0/4) Epoch 11, batch 4250, loss[loss=0.2622, simple_loss=0.3439, pruned_loss=0.09024, over 21763.00 frames. ], tot_loss[loss=0.2098, simple_loss=0.2903, pruned_loss=0.06467, over 4266772.59 frames. ], batch size: 118, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 17:06:21,712 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1855176.0, ans=0.025 2023-06-27 17:06:50,308 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1855236.0, ans=0.1 2023-06-27 17:07:16,839 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1855356.0, ans=0.0 2023-06-27 17:07:49,531 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1855416.0, ans=0.1 2023-06-27 17:07:55,966 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1855416.0, ans=0.1 2023-06-27 17:08:00,642 INFO [train.py:996] (0/4) Epoch 11, batch 4300, loss[loss=0.2292, simple_loss=0.3198, pruned_loss=0.06932, over 21651.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2952, pruned_loss=0.06638, over 4266704.14 frames. ], batch size: 263, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 17:08:01,244 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1855476.0, ans=0.0 2023-06-27 17:08:23,354 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1855536.0, ans=0.1 2023-06-27 17:08:39,842 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1855596.0, ans=0.125 2023-06-27 17:08:40,450 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.67 vs. limit=15.0 2023-06-27 17:08:55,702 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.494e+02 7.202e+02 1.029e+03 1.570e+03 4.728e+03, threshold=2.058e+03, percent-clipped=18.0 2023-06-27 17:09:39,120 INFO [train.py:996] (0/4) Epoch 11, batch 4350, loss[loss=0.2403, simple_loss=0.3456, pruned_loss=0.06751, over 21659.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2951, pruned_loss=0.06588, over 4269626.42 frames. ], batch size: 414, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 17:09:49,608 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1855776.0, ans=0.2 2023-06-27 17:10:10,158 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1855836.0, ans=0.125 2023-06-27 17:10:16,242 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.98 vs. limit=10.0 2023-06-27 17:10:53,420 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.53 vs. limit=15.0 2023-06-27 17:10:53,634 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.28 vs. limit=15.0 2023-06-27 17:11:19,454 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.02 vs. limit=22.5 2023-06-27 17:11:29,248 INFO [train.py:996] (0/4) Epoch 11, batch 4400, loss[loss=0.2012, simple_loss=0.2655, pruned_loss=0.06848, over 21201.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2902, pruned_loss=0.06604, over 4267029.01 frames. ], batch size: 608, lr: 2.70e-03, grad_scale: 32.0 2023-06-27 17:11:33,728 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_na.min_abs, batch_count=1856076.0, ans=0.02 2023-06-27 17:11:37,239 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1856076.0, ans=0.025 2023-06-27 17:11:47,566 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1856136.0, ans=0.125 2023-06-27 17:12:16,896 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1856196.0, ans=0.2 2023-06-27 17:12:32,725 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.456e+02 7.940e+02 1.162e+03 1.682e+03 5.044e+03, threshold=2.325e+03, percent-clipped=15.0 2023-06-27 17:13:05,762 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=6.11 vs. limit=22.5 2023-06-27 17:13:14,919 INFO [train.py:996] (0/4) Epoch 11, batch 4450, loss[loss=0.2566, simple_loss=0.3521, pruned_loss=0.08053, over 21678.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.2981, pruned_loss=0.06737, over 4268502.03 frames. ], batch size: 389, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 17:13:37,642 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1856436.0, ans=0.0 2023-06-27 17:14:18,273 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1856496.0, ans=0.125 2023-06-27 17:14:30,170 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=1856556.0, ans=0.5 2023-06-27 17:14:51,741 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1856616.0, ans=0.0 2023-06-27 17:14:59,763 INFO [train.py:996] (0/4) Epoch 11, batch 4500, loss[loss=0.2018, simple_loss=0.2826, pruned_loss=0.0605, over 21764.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.3011, pruned_loss=0.06865, over 4275923.95 frames. ], batch size: 247, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 17:15:17,477 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.62 vs. limit=10.0 2023-06-27 17:15:23,749 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1856736.0, ans=0.0 2023-06-27 17:15:35,635 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1856736.0, ans=0.125 2023-06-27 17:15:58,284 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1856856.0, ans=0.125 2023-06-27 17:16:01,160 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.839e+02 8.296e+02 1.426e+03 1.842e+03 5.527e+03, threshold=2.851e+03, percent-clipped=18.0 2023-06-27 17:16:23,569 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1856916.0, ans=0.0 2023-06-27 17:16:38,265 INFO [train.py:996] (0/4) Epoch 11, batch 4550, loss[loss=0.2495, simple_loss=0.3318, pruned_loss=0.08362, over 21216.00 frames. ], tot_loss[loss=0.221, simple_loss=0.3041, pruned_loss=0.06892, over 4280906.42 frames. ], batch size: 159, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 17:17:24,808 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1857096.0, ans=0.125 2023-06-27 17:18:09,181 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1857216.0, ans=0.1 2023-06-27 17:18:21,934 INFO [train.py:996] (0/4) Epoch 11, batch 4600, loss[loss=0.2115, simple_loss=0.2922, pruned_loss=0.06538, over 21868.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.3066, pruned_loss=0.07088, over 4281861.65 frames. ], batch size: 371, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 17:18:26,503 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.64 vs. limit=6.0 2023-06-27 17:18:29,669 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.17 vs. limit=15.0 2023-06-27 17:19:03,075 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1857336.0, ans=0.1 2023-06-27 17:19:26,231 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1857456.0, ans=0.05 2023-06-27 17:19:33,362 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.579e+02 7.640e+02 1.105e+03 1.523e+03 3.294e+03, threshold=2.209e+03, percent-clipped=1.0 2023-06-27 17:19:56,844 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.23 vs. limit=15.0 2023-06-27 17:20:05,570 INFO [train.py:996] (0/4) Epoch 11, batch 4650, loss[loss=0.1805, simple_loss=0.2587, pruned_loss=0.05114, over 21810.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.3007, pruned_loss=0.06916, over 4281913.60 frames. ], batch size: 351, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 17:21:14,244 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1857756.0, ans=0.0 2023-06-27 17:21:49,625 INFO [train.py:996] (0/4) Epoch 11, batch 4700, loss[loss=0.1788, simple_loss=0.2517, pruned_loss=0.05294, over 21769.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2926, pruned_loss=0.06748, over 4282958.15 frames. ], batch size: 351, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 17:22:05,190 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1857876.0, ans=0.1 2023-06-27 17:22:49,873 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.57 vs. limit=22.5 2023-06-27 17:22:59,953 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.795e+02 6.918e+02 1.097e+03 1.707e+03 4.002e+03, threshold=2.193e+03, percent-clipped=11.0 2023-06-27 17:23:08,600 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1858056.0, ans=0.125 2023-06-27 17:23:22,000 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1858116.0, ans=0.125 2023-06-27 17:23:31,322 INFO [train.py:996] (0/4) Epoch 11, batch 4750, loss[loss=0.2635, simple_loss=0.3148, pruned_loss=0.1061, over 21499.00 frames. ], tot_loss[loss=0.213, simple_loss=0.2904, pruned_loss=0.06781, over 4277096.32 frames. ], batch size: 471, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 17:24:09,415 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1858236.0, ans=0.035 2023-06-27 17:24:11,867 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.55 vs. limit=10.0 2023-06-27 17:24:27,812 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1858296.0, ans=0.125 2023-06-27 17:24:28,330 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.34 vs. limit=22.5 2023-06-27 17:25:20,793 INFO [train.py:996] (0/4) Epoch 11, batch 4800, loss[loss=0.2009, simple_loss=0.2785, pruned_loss=0.06163, over 21843.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2906, pruned_loss=0.06832, over 4284879.63 frames. ], batch size: 282, lr: 2.70e-03, grad_scale: 32.0 2023-06-27 17:25:55,495 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.19 vs. limit=22.5 2023-06-27 17:26:28,649 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.577e+02 8.092e+02 1.102e+03 1.736e+03 3.587e+03, threshold=2.204e+03, percent-clipped=14.0 2023-06-27 17:26:34,554 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1858656.0, ans=0.125 2023-06-27 17:26:41,161 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1858716.0, ans=0.125 2023-06-27 17:26:54,074 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1858716.0, ans=0.0 2023-06-27 17:27:03,217 INFO [train.py:996] (0/4) Epoch 11, batch 4850, loss[loss=0.1859, simple_loss=0.2577, pruned_loss=0.0571, over 21724.00 frames. ], tot_loss[loss=0.2123, simple_loss=0.2886, pruned_loss=0.06795, over 4282644.78 frames. ], batch size: 247, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 17:27:03,806 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1858776.0, ans=0.0 2023-06-27 17:27:07,169 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1858776.0, ans=0.125 2023-06-27 17:27:14,122 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1858776.0, ans=0.125 2023-06-27 17:28:01,840 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1858896.0, ans=0.125 2023-06-27 17:28:03,496 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1858896.0, ans=0.125 2023-06-27 17:28:41,958 INFO [train.py:996] (0/4) Epoch 11, batch 4900, loss[loss=0.2236, simple_loss=0.2968, pruned_loss=0.07519, over 21447.00 frames. ], tot_loss[loss=0.2123, simple_loss=0.2887, pruned_loss=0.06792, over 4283085.43 frames. ], batch size: 211, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 17:28:50,428 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1859076.0, ans=0.125 2023-06-27 17:28:53,693 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1859076.0, ans=0.1 2023-06-27 17:28:53,741 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1859076.0, ans=0.5 2023-06-27 17:29:03,967 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1859076.0, ans=0.1 2023-06-27 17:29:56,066 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.319e+02 7.657e+02 1.361e+03 1.915e+03 3.497e+03, threshold=2.723e+03, percent-clipped=17.0 2023-06-27 17:29:56,640 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1859256.0, ans=0.0 2023-06-27 17:30:24,994 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1859316.0, ans=0.125 2023-06-27 17:30:31,180 INFO [train.py:996] (0/4) Epoch 11, batch 4950, loss[loss=0.1769, simple_loss=0.2707, pruned_loss=0.04152, over 21407.00 frames. ], tot_loss[loss=0.213, simple_loss=0.2924, pruned_loss=0.06676, over 4279984.39 frames. ], batch size: 211, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 17:31:11,968 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1859436.0, ans=0.1 2023-06-27 17:31:23,203 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1859496.0, ans=0.1 2023-06-27 17:31:31,337 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1859496.0, ans=0.0 2023-06-27 17:31:52,939 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1859616.0, ans=0.0 2023-06-27 17:32:14,077 INFO [train.py:996] (0/4) Epoch 11, batch 5000, loss[loss=0.2503, simple_loss=0.3201, pruned_loss=0.09023, over 21799.00 frames. ], tot_loss[loss=0.2098, simple_loss=0.2913, pruned_loss=0.06418, over 4280689.57 frames. ], batch size: 112, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 17:32:37,563 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1859676.0, ans=0.125 2023-06-27 17:32:38,300 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.23 vs. limit=15.0 2023-06-27 17:32:58,566 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 17:33:20,249 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.265e+02 5.938e+02 8.345e+02 1.344e+03 2.733e+03, threshold=1.669e+03, percent-clipped=1.0 2023-06-27 17:33:21,576 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.60 vs. limit=15.0 2023-06-27 17:33:45,892 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1859916.0, ans=0.2 2023-06-27 17:33:50,176 INFO [train.py:996] (0/4) Epoch 11, batch 5050, loss[loss=0.1941, simple_loss=0.2715, pruned_loss=0.05833, over 21637.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.2917, pruned_loss=0.0658, over 4278754.12 frames. ], batch size: 263, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 17:34:51,615 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.95 vs. limit=12.0 2023-06-27 17:35:09,161 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1860156.0, ans=0.125 2023-06-27 17:35:14,233 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1860156.0, ans=0.0 2023-06-27 17:35:19,334 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1860216.0, ans=0.035 2023-06-27 17:35:20,881 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1860216.0, ans=0.09899494936611666 2023-06-27 17:35:33,554 INFO [train.py:996] (0/4) Epoch 11, batch 5100, loss[loss=0.1782, simple_loss=0.2586, pruned_loss=0.04889, over 21793.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.2905, pruned_loss=0.06642, over 4289448.50 frames. ], batch size: 247, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 17:36:22,695 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1860396.0, ans=0.1 2023-06-27 17:36:41,432 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1860396.0, ans=0.125 2023-06-27 17:36:44,954 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1860456.0, ans=0.0 2023-06-27 17:36:47,576 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.357e+02 6.699e+02 8.715e+02 1.182e+03 3.007e+03, threshold=1.743e+03, percent-clipped=11.0 2023-06-27 17:36:56,629 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1860456.0, ans=0.125 2023-06-27 17:37:07,476 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.49 vs. limit=15.0 2023-06-27 17:37:23,093 INFO [train.py:996] (0/4) Epoch 11, batch 5150, loss[loss=0.2078, simple_loss=0.277, pruned_loss=0.06931, over 21345.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.2872, pruned_loss=0.06651, over 4290622.44 frames. ], batch size: 144, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 17:38:58,787 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff2.min_abs, batch_count=1860816.0, ans=0.1 2023-06-27 17:39:12,462 INFO [train.py:996] (0/4) Epoch 11, batch 5200, loss[loss=0.2025, simple_loss=0.2772, pruned_loss=0.06391, over 21027.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2877, pruned_loss=0.06651, over 4287060.41 frames. ], batch size: 607, lr: 2.69e-03, grad_scale: 32.0 2023-06-27 17:40:08,801 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.35 vs. limit=15.0 2023-06-27 17:40:17,761 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.467e+02 7.745e+02 1.179e+03 1.665e+03 4.294e+03, threshold=2.357e+03, percent-clipped=21.0 2023-06-27 17:41:00,886 INFO [train.py:996] (0/4) Epoch 11, batch 5250, loss[loss=0.2044, simple_loss=0.2975, pruned_loss=0.05567, over 21634.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2921, pruned_loss=0.06555, over 4290402.32 frames. ], batch size: 263, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 17:41:01,564 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1861176.0, ans=0.04949747468305833 2023-06-27 17:41:50,127 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 17:42:13,254 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1861356.0, ans=0.1 2023-06-27 17:42:18,147 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1861416.0, ans=0.2 2023-06-27 17:42:41,273 INFO [train.py:996] (0/4) Epoch 11, batch 5300, loss[loss=0.2194, simple_loss=0.2911, pruned_loss=0.07384, over 21336.00 frames. ], tot_loss[loss=0.211, simple_loss=0.2901, pruned_loss=0.06594, over 4287826.37 frames. ], batch size: 144, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 17:42:48,448 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1861476.0, ans=0.125 2023-06-27 17:42:50,069 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1861476.0, ans=0.125 2023-06-27 17:43:39,367 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.786e+02 7.949e+02 1.214e+03 1.979e+03 3.974e+03, threshold=2.428e+03, percent-clipped=14.0 2023-06-27 17:43:50,058 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1861716.0, ans=0.125 2023-06-27 17:44:21,752 INFO [train.py:996] (0/4) Epoch 11, batch 5350, loss[loss=0.1974, simple_loss=0.2792, pruned_loss=0.05785, over 21952.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.2901, pruned_loss=0.06734, over 4287250.87 frames. ], batch size: 113, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 17:44:40,762 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1861836.0, ans=0.1 2023-06-27 17:46:05,631 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.70 vs. limit=15.0 2023-06-27 17:46:05,871 INFO [train.py:996] (0/4) Epoch 11, batch 5400, loss[loss=0.1985, simple_loss=0.2654, pruned_loss=0.06575, over 21580.00 frames. ], tot_loss[loss=0.2129, simple_loss=0.2888, pruned_loss=0.06845, over 4290801.46 frames. ], batch size: 548, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 17:46:18,560 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1862076.0, ans=0.125 2023-06-27 17:46:38,261 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1862136.0, ans=0.125 2023-06-27 17:46:52,288 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1862196.0, ans=0.125 2023-06-27 17:47:07,287 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.410e+02 6.450e+02 1.066e+03 1.376e+03 3.123e+03, threshold=2.132e+03, percent-clipped=3.0 2023-06-27 17:47:14,523 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1862256.0, ans=0.125 2023-06-27 17:47:25,222 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.37 vs. limit=15.0 2023-06-27 17:47:39,346 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1862316.0, ans=0.1 2023-06-27 17:47:45,907 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1862316.0, ans=0.125 2023-06-27 17:47:47,736 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1862316.0, ans=0.0 2023-06-27 17:47:50,501 INFO [train.py:996] (0/4) Epoch 11, batch 5450, loss[loss=0.2501, simple_loss=0.35, pruned_loss=0.07511, over 21666.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2897, pruned_loss=0.06676, over 4295507.39 frames. ], batch size: 414, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 17:47:56,916 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.82 vs. limit=15.0 2023-06-27 17:48:22,283 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.69 vs. limit=15.0 2023-06-27 17:48:27,061 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1862436.0, ans=0.125 2023-06-27 17:48:31,962 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1862496.0, ans=0.0 2023-06-27 17:48:39,200 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1862496.0, ans=0.125 2023-06-27 17:49:26,196 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1862616.0, ans=0.1 2023-06-27 17:49:40,245 INFO [train.py:996] (0/4) Epoch 11, batch 5500, loss[loss=0.2212, simple_loss=0.3246, pruned_loss=0.0589, over 21288.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.295, pruned_loss=0.06407, over 4290040.28 frames. ], batch size: 548, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 17:50:48,688 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 17:50:49,748 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.051e+02 7.572e+02 1.528e+03 2.313e+03 5.179e+03, threshold=3.055e+03, percent-clipped=29.0 2023-06-27 17:51:18,764 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.51 vs. limit=15.0 2023-06-27 17:51:23,604 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1862976.0, ans=0.0 2023-06-27 17:51:24,530 INFO [train.py:996] (0/4) Epoch 11, batch 5550, loss[loss=0.182, simple_loss=0.2869, pruned_loss=0.03857, over 21629.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.2954, pruned_loss=0.0616, over 4288689.82 frames. ], batch size: 414, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 17:51:34,419 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.10 vs. limit=22.5 2023-06-27 17:52:24,759 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1863156.0, ans=0.1 2023-06-27 17:52:31,569 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1863156.0, ans=0.125 2023-06-27 17:52:53,239 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1863216.0, ans=0.05 2023-06-27 17:52:56,648 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1863216.0, ans=0.1 2023-06-27 17:53:04,463 INFO [train.py:996] (0/4) Epoch 11, batch 5600, loss[loss=0.1795, simple_loss=0.29, pruned_loss=0.03449, over 21169.00 frames. ], tot_loss[loss=0.2067, simple_loss=0.2947, pruned_loss=0.05931, over 4280580.18 frames. ], batch size: 548, lr: 2.69e-03, grad_scale: 32.0 2023-06-27 17:53:18,790 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1863276.0, ans=0.0 2023-06-27 17:53:41,014 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1863336.0, ans=0.0 2023-06-27 17:54:01,185 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1863396.0, ans=0.1 2023-06-27 17:54:13,841 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.087e+02 7.294e+02 1.095e+03 1.659e+03 3.151e+03, threshold=2.190e+03, percent-clipped=1.0 2023-06-27 17:54:41,739 INFO [train.py:996] (0/4) Epoch 11, batch 5650, loss[loss=0.2091, simple_loss=0.2908, pruned_loss=0.06372, over 21872.00 frames. ], tot_loss[loss=0.2087, simple_loss=0.2964, pruned_loss=0.06048, over 4281523.57 frames. ], batch size: 351, lr: 2.69e-03, grad_scale: 32.0 2023-06-27 17:54:47,638 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1863576.0, ans=0.0 2023-06-27 17:56:17,546 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.72 vs. limit=10.0 2023-06-27 17:56:19,728 INFO [train.py:996] (0/4) Epoch 11, batch 5700, loss[loss=0.1989, simple_loss=0.2709, pruned_loss=0.06349, over 21550.00 frames. ], tot_loss[loss=0.2103, simple_loss=0.2959, pruned_loss=0.0623, over 4281845.05 frames. ], batch size: 548, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 17:57:05,565 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=16.08 vs. limit=22.5 2023-06-27 17:57:32,521 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.653e+02 6.609e+02 9.381e+02 1.350e+03 3.463e+03, threshold=1.876e+03, percent-clipped=9.0 2023-06-27 17:57:43,464 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1864116.0, ans=0.5 2023-06-27 17:57:53,293 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1864116.0, ans=0.2 2023-06-27 17:58:13,664 INFO [train.py:996] (0/4) Epoch 11, batch 5750, loss[loss=0.1599, simple_loss=0.2502, pruned_loss=0.03483, over 21687.00 frames. ], tot_loss[loss=0.2051, simple_loss=0.2914, pruned_loss=0.05945, over 4280627.43 frames. ], batch size: 263, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 17:59:04,190 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.10 vs. limit=12.0 2023-06-27 17:59:13,538 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1864356.0, ans=0.0 2023-06-27 17:59:18,654 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1864356.0, ans=0.95 2023-06-27 17:59:20,124 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1864356.0, ans=0.0 2023-06-27 17:59:56,946 INFO [train.py:996] (0/4) Epoch 11, batch 5800, loss[loss=0.2084, simple_loss=0.3058, pruned_loss=0.05552, over 21731.00 frames. ], tot_loss[loss=0.2046, simple_loss=0.2925, pruned_loss=0.05833, over 4271659.24 frames. ], batch size: 298, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:00:07,543 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1864476.0, ans=0.125 2023-06-27 18:00:16,836 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.80 vs. limit=22.5 2023-06-27 18:00:36,660 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1864596.0, ans=10.0 2023-06-27 18:00:38,404 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1864596.0, ans=0.1 2023-06-27 18:01:04,531 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.579e+02 7.155e+02 1.088e+03 1.847e+03 4.141e+03, threshold=2.176e+03, percent-clipped=25.0 2023-06-27 18:01:41,157 INFO [train.py:996] (0/4) Epoch 11, batch 5850, loss[loss=0.1959, simple_loss=0.3061, pruned_loss=0.04283, over 21625.00 frames. ], tot_loss[loss=0.2011, simple_loss=0.2915, pruned_loss=0.05533, over 4278252.64 frames. ], batch size: 441, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:02:18,665 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1864896.0, ans=0.0 2023-06-27 18:02:33,868 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.33 vs. limit=22.5 2023-06-27 18:03:07,522 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=1865016.0, ans=22.5 2023-06-27 18:03:17,826 INFO [train.py:996] (0/4) Epoch 11, batch 5900, loss[loss=0.228, simple_loss=0.298, pruned_loss=0.07901, over 20046.00 frames. ], tot_loss[loss=0.1934, simple_loss=0.2841, pruned_loss=0.05134, over 4282599.64 frames. ], batch size: 702, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:03:20,066 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1865076.0, ans=0.125 2023-06-27 18:03:33,137 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1865076.0, ans=0.2 2023-06-27 18:03:40,465 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.28 vs. limit=15.0 2023-06-27 18:03:47,261 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.06 vs. limit=10.0 2023-06-27 18:04:01,672 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1865196.0, ans=0.2 2023-06-27 18:04:28,081 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.367e+02 6.471e+02 9.679e+02 1.352e+03 2.438e+03, threshold=1.936e+03, percent-clipped=4.0 2023-06-27 18:04:35,958 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.01 vs. limit=6.0 2023-06-27 18:04:46,754 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=1865316.0, ans=0.95 2023-06-27 18:04:48,553 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1865316.0, ans=0.0 2023-06-27 18:04:54,754 INFO [train.py:996] (0/4) Epoch 11, batch 5950, loss[loss=0.1984, simple_loss=0.2704, pruned_loss=0.0632, over 21358.00 frames. ], tot_loss[loss=0.1961, simple_loss=0.2828, pruned_loss=0.0547, over 4282236.98 frames. ], batch size: 131, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:05:11,695 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1865376.0, ans=0.0 2023-06-27 18:06:37,189 INFO [train.py:996] (0/4) Epoch 11, batch 6000, loss[loss=0.2051, simple_loss=0.2666, pruned_loss=0.07183, over 21517.00 frames. ], tot_loss[loss=0.1962, simple_loss=0.2782, pruned_loss=0.05703, over 4271925.90 frames. ], batch size: 391, lr: 2.69e-03, grad_scale: 32.0 2023-06-27 18:06:37,190 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-27 18:06:56,344 INFO [train.py:1028] (0/4) Epoch 11, validation: loss=0.2612, simple_loss=0.354, pruned_loss=0.08419, over 1796401.00 frames. 2023-06-27 18:06:56,345 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23714MB 2023-06-27 18:07:17,588 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1865736.0, ans=0.0 2023-06-27 18:07:53,683 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1865796.0, ans=0.09899494936611666 2023-06-27 18:08:10,062 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.353e+02 5.907e+02 8.109e+02 1.325e+03 2.971e+03, threshold=1.622e+03, percent-clipped=7.0 2023-06-27 18:08:39,957 INFO [train.py:996] (0/4) Epoch 11, batch 6050, loss[loss=0.1906, simple_loss=0.2554, pruned_loss=0.06292, over 21374.00 frames. ], tot_loss[loss=0.1956, simple_loss=0.2738, pruned_loss=0.05871, over 4263286.40 frames. ], batch size: 160, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:10:03,488 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1866216.0, ans=0.125 2023-06-27 18:10:17,459 INFO [train.py:996] (0/4) Epoch 11, batch 6100, loss[loss=0.1907, simple_loss=0.2757, pruned_loss=0.05283, over 21591.00 frames. ], tot_loss[loss=0.195, simple_loss=0.2739, pruned_loss=0.05801, over 4268419.62 frames. ], batch size: 263, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:11:29,685 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.047e+02 7.065e+02 1.029e+03 1.365e+03 3.489e+03, threshold=2.059e+03, percent-clipped=16.0 2023-06-27 18:11:59,723 INFO [train.py:996] (0/4) Epoch 11, batch 6150, loss[loss=0.2123, simple_loss=0.2816, pruned_loss=0.07153, over 21844.00 frames. ], tot_loss[loss=0.1989, simple_loss=0.2775, pruned_loss=0.06015, over 4265426.18 frames. ], batch size: 98, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:12:07,180 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1866576.0, ans=0.125 2023-06-27 18:13:38,541 INFO [train.py:996] (0/4) Epoch 11, batch 6200, loss[loss=0.224, simple_loss=0.3115, pruned_loss=0.06826, over 21798.00 frames. ], tot_loss[loss=0.2009, simple_loss=0.2801, pruned_loss=0.06089, over 4273403.77 frames. ], batch size: 298, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:14:32,280 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.31 vs. limit=15.0 2023-06-27 18:14:42,367 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=18.08 vs. limit=22.5 2023-06-27 18:14:50,288 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.73 vs. limit=15.0 2023-06-27 18:14:52,449 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.390e+02 7.354e+02 1.075e+03 1.607e+03 4.153e+03, threshold=2.150e+03, percent-clipped=10.0 2023-06-27 18:15:18,551 INFO [train.py:996] (0/4) Epoch 11, batch 6250, loss[loss=0.225, simple_loss=0.3302, pruned_loss=0.0599, over 21716.00 frames. ], tot_loss[loss=0.2037, simple_loss=0.2861, pruned_loss=0.06065, over 4282340.43 frames. ], batch size: 351, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:15:56,887 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1867236.0, ans=0.0 2023-06-27 18:15:58,652 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1867236.0, ans=0.05 2023-06-27 18:16:07,518 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.47 vs. limit=22.5 2023-06-27 18:16:07,554 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.78 vs. limit=22.5 2023-06-27 18:16:33,509 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1867356.0, ans=0.125 2023-06-27 18:17:03,013 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1867416.0, ans=0.125 2023-06-27 18:17:10,368 INFO [train.py:996] (0/4) Epoch 11, batch 6300, loss[loss=0.2648, simple_loss=0.3155, pruned_loss=0.1071, over 21766.00 frames. ], tot_loss[loss=0.2041, simple_loss=0.288, pruned_loss=0.06009, over 4281899.93 frames. ], batch size: 507, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:17:17,904 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.74 vs. limit=15.0 2023-06-27 18:18:17,775 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.254e+02 6.166e+02 8.295e+02 1.136e+03 2.739e+03, threshold=1.659e+03, percent-clipped=3.0 2023-06-27 18:18:49,490 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1867716.0, ans=0.1 2023-06-27 18:18:52,466 INFO [train.py:996] (0/4) Epoch 11, batch 6350, loss[loss=0.2486, simple_loss=0.3213, pruned_loss=0.08793, over 21589.00 frames. ], tot_loss[loss=0.209, simple_loss=0.2906, pruned_loss=0.06368, over 4286024.63 frames. ], batch size: 389, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:19:09,743 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1867776.0, ans=0.125 2023-06-27 18:20:24,676 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1868016.0, ans=0.2 2023-06-27 18:20:40,584 INFO [train.py:996] (0/4) Epoch 11, batch 6400, loss[loss=0.2306, simple_loss=0.3075, pruned_loss=0.07688, over 21406.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2956, pruned_loss=0.06749, over 4289399.99 frames. ], batch size: 549, lr: 2.69e-03, grad_scale: 32.0 2023-06-27 18:21:43,405 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1868256.0, ans=0.125 2023-06-27 18:21:55,775 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.666e+02 7.590e+02 1.060e+03 1.570e+03 3.138e+03, threshold=2.120e+03, percent-clipped=19.0 2023-06-27 18:22:23,561 INFO [train.py:996] (0/4) Epoch 11, batch 6450, loss[loss=0.1908, simple_loss=0.2795, pruned_loss=0.05106, over 21686.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.2983, pruned_loss=0.06665, over 4285368.52 frames. ], batch size: 282, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:22:48,352 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1868436.0, ans=0.0 2023-06-27 18:22:51,775 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1868436.0, ans=0.125 2023-06-27 18:23:38,781 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1868556.0, ans=0.1 2023-06-27 18:24:06,999 INFO [train.py:996] (0/4) Epoch 11, batch 6500, loss[loss=0.2106, simple_loss=0.2742, pruned_loss=0.07352, over 21831.00 frames. ], tot_loss[loss=0.2125, simple_loss=0.2932, pruned_loss=0.06586, over 4287216.16 frames. ], batch size: 372, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:24:10,904 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1868676.0, ans=0.125 2023-06-27 18:24:12,465 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1868676.0, ans=0.0 2023-06-27 18:24:12,478 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 18:24:33,477 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1868736.0, ans=0.125 2023-06-27 18:24:46,928 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1868796.0, ans=0.125 2023-06-27 18:24:50,300 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1868796.0, ans=0.0 2023-06-27 18:25:13,490 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1868856.0, ans=0.1 2023-06-27 18:25:20,919 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.722e+02 7.121e+02 1.016e+03 1.758e+03 3.430e+03, threshold=2.032e+03, percent-clipped=12.0 2023-06-27 18:25:48,615 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.63 vs. limit=15.0 2023-06-27 18:25:48,832 INFO [train.py:996] (0/4) Epoch 11, batch 6550, loss[loss=0.2296, simple_loss=0.3066, pruned_loss=0.07629, over 21746.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.2924, pruned_loss=0.06432, over 4282110.46 frames. ], batch size: 414, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:25:57,980 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1868976.0, ans=0.2 2023-06-27 18:26:25,248 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1869096.0, ans=0.0 2023-06-27 18:26:57,895 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1869156.0, ans=0.1 2023-06-27 18:27:08,533 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1869156.0, ans=0.125 2023-06-27 18:27:29,175 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.93 vs. limit=10.0 2023-06-27 18:27:31,154 INFO [train.py:996] (0/4) Epoch 11, batch 6600, loss[loss=0.1862, simple_loss=0.2506, pruned_loss=0.06093, over 21773.00 frames. ], tot_loss[loss=0.2068, simple_loss=0.2863, pruned_loss=0.06368, over 4278509.77 frames. ], batch size: 317, lr: 2.69e-03, grad_scale: 8.0 2023-06-27 18:28:36,368 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1869456.0, ans=0.125 2023-06-27 18:28:50,866 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.168e+02 6.592e+02 1.007e+03 1.403e+03 3.039e+03, threshold=2.014e+03, percent-clipped=10.0 2023-06-27 18:29:03,431 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1869516.0, ans=0.2 2023-06-27 18:29:05,203 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1869516.0, ans=0.07 2023-06-27 18:29:09,113 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.60 vs. limit=10.0 2023-06-27 18:29:12,965 INFO [train.py:996] (0/4) Epoch 11, batch 6650, loss[loss=0.1682, simple_loss=0.2445, pruned_loss=0.0459, over 21302.00 frames. ], tot_loss[loss=0.2007, simple_loss=0.2788, pruned_loss=0.0613, over 4270609.73 frames. ], batch size: 160, lr: 2.69e-03, grad_scale: 8.0 2023-06-27 18:29:50,189 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1869636.0, ans=0.0 2023-06-27 18:30:59,832 INFO [train.py:996] (0/4) Epoch 11, batch 6700, loss[loss=0.1821, simple_loss=0.2591, pruned_loss=0.05256, over 21602.00 frames. ], tot_loss[loss=0.1969, simple_loss=0.273, pruned_loss=0.06038, over 4273710.01 frames. ], batch size: 247, lr: 2.69e-03, grad_scale: 8.0 2023-06-27 18:32:16,559 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.214e+02 6.879e+02 9.707e+02 1.410e+03 2.811e+03, threshold=1.941e+03, percent-clipped=3.0 2023-06-27 18:32:21,858 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1870116.0, ans=0.125 2023-06-27 18:32:42,387 INFO [train.py:996] (0/4) Epoch 11, batch 6750, loss[loss=0.2259, simple_loss=0.2923, pruned_loss=0.07982, over 21838.00 frames. ], tot_loss[loss=0.1967, simple_loss=0.2718, pruned_loss=0.0608, over 4275492.83 frames. ], batch size: 371, lr: 2.69e-03, grad_scale: 8.0 2023-06-27 18:33:32,466 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1870296.0, ans=0.125 2023-06-27 18:34:23,480 INFO [train.py:996] (0/4) Epoch 11, batch 6800, loss[loss=0.1967, simple_loss=0.2645, pruned_loss=0.06448, over 21646.00 frames. ], tot_loss[loss=0.1997, simple_loss=0.2744, pruned_loss=0.06251, over 4275498.71 frames. ], batch size: 298, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:34:32,595 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1870476.0, ans=0.125 2023-06-27 18:34:35,882 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1870476.0, ans=0.125 2023-06-27 18:35:05,304 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1870596.0, ans=0.125 2023-06-27 18:35:26,863 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.83 vs. limit=15.0 2023-06-27 18:35:39,118 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.574e+02 7.159e+02 9.186e+02 1.470e+03 3.415e+03, threshold=1.837e+03, percent-clipped=10.0 2023-06-27 18:35:53,518 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.49 vs. limit=5.0 2023-06-27 18:36:00,279 INFO [train.py:996] (0/4) Epoch 11, batch 6850, loss[loss=0.2248, simple_loss=0.2827, pruned_loss=0.08343, over 21579.00 frames. ], tot_loss[loss=0.2023, simple_loss=0.2754, pruned_loss=0.06463, over 4276488.41 frames. ], batch size: 473, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:36:15,451 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1870776.0, ans=0.07 2023-06-27 18:36:18,748 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1870776.0, ans=0.07 2023-06-27 18:36:48,552 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1870896.0, ans=0.1 2023-06-27 18:37:28,912 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1871016.0, ans=0.125 2023-06-27 18:37:43,676 INFO [train.py:996] (0/4) Epoch 11, batch 6900, loss[loss=0.1936, simple_loss=0.2972, pruned_loss=0.04497, over 21689.00 frames. ], tot_loss[loss=0.2023, simple_loss=0.2756, pruned_loss=0.06455, over 4283923.66 frames. ], batch size: 414, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:37:52,468 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1871076.0, ans=10.0 2023-06-27 18:38:18,099 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1871136.0, ans=0.2 2023-06-27 18:38:39,047 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1871196.0, ans=0.125 2023-06-27 18:39:05,836 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.235e+02 7.048e+02 1.193e+03 1.711e+03 4.903e+03, threshold=2.385e+03, percent-clipped=22.0 2023-06-27 18:39:06,362 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1871256.0, ans=0.125 2023-06-27 18:39:25,764 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1871316.0, ans=0.0 2023-06-27 18:39:31,787 INFO [train.py:996] (0/4) Epoch 11, batch 6950, loss[loss=0.2013, simple_loss=0.3053, pruned_loss=0.0487, over 21693.00 frames. ], tot_loss[loss=0.2016, simple_loss=0.2779, pruned_loss=0.06259, over 4283844.20 frames. ], batch size: 441, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:40:47,437 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1871556.0, ans=0.0 2023-06-27 18:41:14,899 INFO [train.py:996] (0/4) Epoch 11, batch 7000, loss[loss=0.2067, simple_loss=0.2703, pruned_loss=0.07159, over 21773.00 frames. ], tot_loss[loss=0.2034, simple_loss=0.2799, pruned_loss=0.06348, over 4263211.51 frames. ], batch size: 371, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:41:17,901 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.66 vs. limit=15.0 2023-06-27 18:41:55,863 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1871736.0, ans=0.125 2023-06-27 18:42:19,368 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1871856.0, ans=0.04949747468305833 2023-06-27 18:42:24,277 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1871856.0, ans=0.125 2023-06-27 18:42:31,979 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.526e+02 6.963e+02 9.301e+02 1.305e+03 2.856e+03, threshold=1.860e+03, percent-clipped=1.0 2023-06-27 18:42:58,608 INFO [train.py:996] (0/4) Epoch 11, batch 7050, loss[loss=0.1764, simple_loss=0.2621, pruned_loss=0.04539, over 21609.00 frames. ], tot_loss[loss=0.2017, simple_loss=0.278, pruned_loss=0.06272, over 4264415.17 frames. ], batch size: 263, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:43:08,793 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-312000.pt 2023-06-27 18:43:13,147 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.43 vs. limit=6.0 2023-06-27 18:43:52,168 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1872096.0, ans=10.0 2023-06-27 18:44:07,693 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1872156.0, ans=0.09899494936611666 2023-06-27 18:44:47,741 INFO [train.py:996] (0/4) Epoch 11, batch 7100, loss[loss=0.2174, simple_loss=0.3013, pruned_loss=0.06678, over 21207.00 frames. ], tot_loss[loss=0.206, simple_loss=0.283, pruned_loss=0.06449, over 4267564.61 frames. ], batch size: 143, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:44:49,823 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1872276.0, ans=0.125 2023-06-27 18:45:00,873 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1872276.0, ans=0.125 2023-06-27 18:45:17,529 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1872336.0, ans=0.125 2023-06-27 18:45:24,900 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=10.12 vs. limit=10.0 2023-06-27 18:46:03,441 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.854e+02 6.087e+02 7.876e+02 1.187e+03 3.248e+03, threshold=1.575e+03, percent-clipped=9.0 2023-06-27 18:46:10,874 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1872516.0, ans=0.1 2023-06-27 18:46:30,035 INFO [train.py:996] (0/4) Epoch 11, batch 7150, loss[loss=0.2663, simple_loss=0.3353, pruned_loss=0.09868, over 21417.00 frames. ], tot_loss[loss=0.2022, simple_loss=0.28, pruned_loss=0.06222, over 4271004.26 frames. ], batch size: 471, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:47:22,945 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1872696.0, ans=0.5 2023-06-27 18:47:24,571 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1872696.0, ans=0.0 2023-06-27 18:48:00,913 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 18:48:04,263 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1872816.0, ans=0.0 2023-06-27 18:48:16,442 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.14 vs. limit=15.0 2023-06-27 18:48:18,367 INFO [train.py:996] (0/4) Epoch 11, batch 7200, loss[loss=0.205, simple_loss=0.2698, pruned_loss=0.07005, over 21837.00 frames. ], tot_loss[loss=0.2058, simple_loss=0.283, pruned_loss=0.06432, over 4268293.57 frames. ], batch size: 373, lr: 2.69e-03, grad_scale: 32.0 2023-06-27 18:48:52,370 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1872936.0, ans=0.125 2023-06-27 18:49:11,736 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1872996.0, ans=0.125 2023-06-27 18:49:35,426 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.391e+02 8.685e+02 1.394e+03 1.830e+03 3.525e+03, threshold=2.788e+03, percent-clipped=36.0 2023-06-27 18:50:04,654 INFO [train.py:996] (0/4) Epoch 11, batch 7250, loss[loss=0.1968, simple_loss=0.2597, pruned_loss=0.06689, over 21156.00 frames. ], tot_loss[loss=0.2042, simple_loss=0.2798, pruned_loss=0.06432, over 4270167.49 frames. ], batch size: 143, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:50:55,574 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.96 vs. limit=15.0 2023-06-27 18:50:58,350 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1873356.0, ans=0.125 2023-06-27 18:51:23,692 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.12 vs. limit=15.0 2023-06-27 18:51:43,883 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.42 vs. limit=15.0 2023-06-27 18:51:47,390 INFO [train.py:996] (0/4) Epoch 11, batch 7300, loss[loss=0.1905, simple_loss=0.2549, pruned_loss=0.06305, over 21655.00 frames. ], tot_loss[loss=0.2015, simple_loss=0.2748, pruned_loss=0.06406, over 4269132.41 frames. ], batch size: 333, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:51:58,011 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1873476.0, ans=0.0 2023-06-27 18:52:49,336 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1873656.0, ans=0.2 2023-06-27 18:52:59,771 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.81 vs. limit=12.0 2023-06-27 18:53:00,322 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.280e+02 7.298e+02 1.227e+03 1.780e+03 3.301e+03, threshold=2.454e+03, percent-clipped=5.0 2023-06-27 18:53:04,155 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1873716.0, ans=0.125 2023-06-27 18:53:26,000 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1873716.0, ans=0.125 2023-06-27 18:53:30,251 INFO [train.py:996] (0/4) Epoch 11, batch 7350, loss[loss=0.2525, simple_loss=0.3116, pruned_loss=0.09667, over 21731.00 frames. ], tot_loss[loss=0.2017, simple_loss=0.2733, pruned_loss=0.06499, over 4267133.00 frames. ], batch size: 441, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:54:19,470 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1873896.0, ans=0.125 2023-06-27 18:54:27,947 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1873896.0, ans=0.1 2023-06-27 18:54:41,585 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1873956.0, ans=10.0 2023-06-27 18:55:11,222 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1874016.0, ans=0.1 2023-06-27 18:55:13,337 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.26 vs. limit=15.0 2023-06-27 18:55:13,757 INFO [train.py:996] (0/4) Epoch 11, batch 7400, loss[loss=0.1816, simple_loss=0.2564, pruned_loss=0.05338, over 21297.00 frames. ], tot_loss[loss=0.2056, simple_loss=0.2783, pruned_loss=0.06648, over 4273253.11 frames. ], batch size: 176, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:55:22,920 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1874076.0, ans=0.0 2023-06-27 18:55:37,766 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1874136.0, ans=0.0 2023-06-27 18:56:05,322 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1874196.0, ans=0.125 2023-06-27 18:56:14,051 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1874256.0, ans=0.1 2023-06-27 18:56:17,237 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1874256.0, ans=0.125 2023-06-27 18:56:31,616 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.427e+02 7.089e+02 1.051e+03 1.718e+03 3.603e+03, threshold=2.102e+03, percent-clipped=3.0 2023-06-27 18:56:42,520 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1874316.0, ans=0.125 2023-06-27 18:56:57,308 INFO [train.py:996] (0/4) Epoch 11, batch 7450, loss[loss=0.2016, simple_loss=0.2649, pruned_loss=0.06913, over 21588.00 frames. ], tot_loss[loss=0.2042, simple_loss=0.277, pruned_loss=0.06566, over 4268778.98 frames. ], batch size: 231, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:57:13,451 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1874436.0, ans=0.2 2023-06-27 18:57:31,477 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1874436.0, ans=0.0 2023-06-27 18:58:41,420 INFO [train.py:996] (0/4) Epoch 11, batch 7500, loss[loss=0.2392, simple_loss=0.3379, pruned_loss=0.07022, over 21678.00 frames. ], tot_loss[loss=0.2085, simple_loss=0.2831, pruned_loss=0.06698, over 4267852.02 frames. ], batch size: 247, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 18:58:55,528 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1874676.0, ans=0.125 2023-06-27 18:59:04,919 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1874736.0, ans=0.125 2023-06-27 18:59:06,730 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1874736.0, ans=0.125 2023-06-27 18:59:42,502 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=16.17 vs. limit=15.0 2023-06-27 19:00:00,251 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=1874856.0, ans=0.025 2023-06-27 19:00:04,710 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.389e+02 7.977e+02 1.325e+03 1.991e+03 3.400e+03, threshold=2.650e+03, percent-clipped=21.0 2023-06-27 19:00:06,888 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1874916.0, ans=0.125 2023-06-27 19:00:13,598 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1874916.0, ans=0.0 2023-06-27 19:00:24,550 INFO [train.py:996] (0/4) Epoch 11, batch 7550, loss[loss=0.2024, simple_loss=0.2644, pruned_loss=0.0702, over 21226.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.2898, pruned_loss=0.06656, over 4273352.60 frames. ], batch size: 608, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:00:41,238 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1875036.0, ans=0.1 2023-06-27 19:01:08,425 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.27 vs. limit=10.0 2023-06-27 19:01:09,560 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1875096.0, ans=0.2 2023-06-27 19:01:12,535 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1875096.0, ans=0.125 2023-06-27 19:01:17,596 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1875096.0, ans=0.1 2023-06-27 19:01:17,654 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1875096.0, ans=0.0 2023-06-27 19:01:30,436 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1875156.0, ans=0.0 2023-06-27 19:01:56,859 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1875216.0, ans=0.125 2023-06-27 19:02:01,988 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.37 vs. limit=6.0 2023-06-27 19:02:05,494 INFO [train.py:996] (0/4) Epoch 11, batch 7600, loss[loss=0.1993, simple_loss=0.3112, pruned_loss=0.04369, over 20818.00 frames. ], tot_loss[loss=0.209, simple_loss=0.2885, pruned_loss=0.06479, over 4276597.61 frames. ], batch size: 608, lr: 2.68e-03, grad_scale: 32.0 2023-06-27 19:02:29,009 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1875336.0, ans=0.125 2023-06-27 19:03:08,092 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1875456.0, ans=0.0 2023-06-27 19:03:16,205 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1875456.0, ans=0.1 2023-06-27 19:03:27,964 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1875456.0, ans=0.125 2023-06-27 19:03:28,848 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.792e+02 7.250e+02 9.858e+02 1.337e+03 3.374e+03, threshold=1.972e+03, percent-clipped=5.0 2023-06-27 19:03:39,511 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1875516.0, ans=0.0 2023-06-27 19:03:47,207 INFO [train.py:996] (0/4) Epoch 11, batch 7650, loss[loss=0.194, simple_loss=0.2629, pruned_loss=0.06253, over 21906.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.2885, pruned_loss=0.06634, over 4279929.26 frames. ], batch size: 283, lr: 2.68e-03, grad_scale: 8.0 2023-06-27 19:04:08,086 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.00 vs. limit=15.0 2023-06-27 19:04:09,797 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.59 vs. limit=15.0 2023-06-27 19:04:48,241 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1875696.0, ans=0.125 2023-06-27 19:05:03,196 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1875756.0, ans=0.0 2023-06-27 19:05:11,445 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1875756.0, ans=0.0 2023-06-27 19:05:25,225 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.64 vs. limit=15.0 2023-06-27 19:05:29,944 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1875876.0, ans=0.125 2023-06-27 19:05:30,771 INFO [train.py:996] (0/4) Epoch 11, batch 7700, loss[loss=0.2571, simple_loss=0.3325, pruned_loss=0.09086, over 21330.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2929, pruned_loss=0.0698, over 4283769.92 frames. ], batch size: 176, lr: 2.68e-03, grad_scale: 8.0 2023-06-27 19:06:00,512 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1875936.0, ans=0.0 2023-06-27 19:06:57,054 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1876056.0, ans=0.125 2023-06-27 19:06:59,794 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.712e+02 8.160e+02 1.175e+03 1.754e+03 4.757e+03, threshold=2.350e+03, percent-clipped=23.0 2023-06-27 19:07:16,831 INFO [train.py:996] (0/4) Epoch 11, batch 7750, loss[loss=0.2575, simple_loss=0.3641, pruned_loss=0.07549, over 21867.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.3, pruned_loss=0.06927, over 4274892.06 frames. ], batch size: 372, lr: 2.68e-03, grad_scale: 8.0 2023-06-27 19:07:23,608 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1876176.0, ans=0.0 2023-06-27 19:07:32,435 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1876176.0, ans=0.0 2023-06-27 19:09:10,467 INFO [train.py:996] (0/4) Epoch 11, batch 7800, loss[loss=0.1957, simple_loss=0.2636, pruned_loss=0.06388, over 21514.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.2992, pruned_loss=0.06906, over 4267711.06 frames. ], batch size: 195, lr: 2.68e-03, grad_scale: 8.0 2023-06-27 19:09:11,190 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 19:09:39,560 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1876536.0, ans=0.125 2023-06-27 19:09:57,801 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=1876596.0, ans=15.0 2023-06-27 19:10:24,194 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1876656.0, ans=0.125 2023-06-27 19:10:26,680 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.510e+02 6.767e+02 1.181e+03 1.586e+03 4.451e+03, threshold=2.363e+03, percent-clipped=7.0 2023-06-27 19:10:53,758 INFO [train.py:996] (0/4) Epoch 11, batch 7850, loss[loss=0.1972, simple_loss=0.2612, pruned_loss=0.06662, over 21208.00 frames. ], tot_loss[loss=0.2133, simple_loss=0.2902, pruned_loss=0.06819, over 4260097.45 frames. ], batch size: 144, lr: 2.68e-03, grad_scale: 8.0 2023-06-27 19:10:57,670 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1876776.0, ans=0.125 2023-06-27 19:11:00,129 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.26 vs. limit=15.0 2023-06-27 19:11:18,562 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.20 vs. limit=22.5 2023-06-27 19:11:21,435 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1876836.0, ans=0.125 2023-06-27 19:11:30,256 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.77 vs. limit=15.0 2023-06-27 19:11:36,781 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1876896.0, ans=0.1 2023-06-27 19:11:56,189 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1876956.0, ans=0.025 2023-06-27 19:11:59,580 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1876956.0, ans=0.125 2023-06-27 19:12:40,441 INFO [train.py:996] (0/4) Epoch 11, batch 7900, loss[loss=0.1836, simple_loss=0.2477, pruned_loss=0.05977, over 21338.00 frames. ], tot_loss[loss=0.211, simple_loss=0.2875, pruned_loss=0.06724, over 4256302.63 frames. ], batch size: 177, lr: 2.68e-03, grad_scale: 8.0 2023-06-27 19:13:07,167 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1877136.0, ans=0.04949747468305833 2023-06-27 19:13:31,318 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1877196.0, ans=0.125 2023-06-27 19:14:08,124 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.410e+02 7.562e+02 1.142e+03 1.795e+03 4.843e+03, threshold=2.283e+03, percent-clipped=15.0 2023-06-27 19:14:29,967 INFO [train.py:996] (0/4) Epoch 11, batch 7950, loss[loss=0.2141, simple_loss=0.2977, pruned_loss=0.06527, over 20789.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2924, pruned_loss=0.06658, over 4266425.76 frames. ], batch size: 609, lr: 2.68e-03, grad_scale: 8.0 2023-06-27 19:14:56,428 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1877436.0, ans=0.0 2023-06-27 19:15:29,090 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1877496.0, ans=0.1 2023-06-27 19:16:22,061 INFO [train.py:996] (0/4) Epoch 11, batch 8000, loss[loss=0.304, simple_loss=0.3706, pruned_loss=0.1187, over 21366.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2972, pruned_loss=0.06912, over 4273500.13 frames. ], batch size: 507, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:16:39,262 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.82 vs. limit=15.0 2023-06-27 19:17:04,832 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.38 vs. limit=12.0 2023-06-27 19:17:51,520 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.031e+02 6.364e+02 9.395e+02 1.417e+03 3.378e+03, threshold=1.879e+03, percent-clipped=5.0 2023-06-27 19:17:52,737 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.06 vs. limit=15.0 2023-06-27 19:18:08,688 INFO [train.py:996] (0/4) Epoch 11, batch 8050, loss[loss=0.2495, simple_loss=0.3416, pruned_loss=0.07869, over 21664.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.2992, pruned_loss=0.06897, over 4269178.73 frames. ], batch size: 389, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:19:45,311 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.10 vs. limit=15.0 2023-06-27 19:19:53,006 INFO [train.py:996] (0/4) Epoch 11, batch 8100, loss[loss=0.2132, simple_loss=0.2846, pruned_loss=0.07095, over 21553.00 frames. ], tot_loss[loss=0.2181, simple_loss=0.2981, pruned_loss=0.06908, over 4271227.75 frames. ], batch size: 212, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:20:39,454 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 19:21:07,657 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1878456.0, ans=0.1 2023-06-27 19:21:22,432 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.025e+02 8.290e+02 1.329e+03 2.139e+03 5.514e+03, threshold=2.658e+03, percent-clipped=35.0 2023-06-27 19:21:23,640 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.24 vs. limit=15.0 2023-06-27 19:21:48,875 INFO [train.py:996] (0/4) Epoch 11, batch 8150, loss[loss=0.2217, simple_loss=0.3241, pruned_loss=0.05966, over 21678.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.3044, pruned_loss=0.07068, over 4275458.47 frames. ], batch size: 298, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:22:15,413 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1878636.0, ans=0.125 2023-06-27 19:22:28,590 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1878636.0, ans=0.125 2023-06-27 19:22:46,969 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1878756.0, ans=0.1 2023-06-27 19:23:31,181 INFO [train.py:996] (0/4) Epoch 11, batch 8200, loss[loss=0.1942, simple_loss=0.2617, pruned_loss=0.0634, over 21878.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2966, pruned_loss=0.06828, over 4273042.59 frames. ], batch size: 373, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:23:46,718 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1878876.0, ans=0.125 2023-06-27 19:24:53,468 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.441e+02 7.151e+02 1.119e+03 1.525e+03 4.860e+03, threshold=2.239e+03, percent-clipped=3.0 2023-06-27 19:25:07,458 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1879116.0, ans=0.1 2023-06-27 19:25:07,540 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1879116.0, ans=0.0 2023-06-27 19:25:15,173 INFO [train.py:996] (0/4) Epoch 11, batch 8250, loss[loss=0.1996, simple_loss=0.2784, pruned_loss=0.06044, over 21270.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2929, pruned_loss=0.06766, over 4266341.88 frames. ], batch size: 176, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:26:01,289 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1879296.0, ans=0.2 2023-06-27 19:26:43,175 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1879416.0, ans=0.0 2023-06-27 19:26:59,255 INFO [train.py:996] (0/4) Epoch 11, batch 8300, loss[loss=0.2116, simple_loss=0.3002, pruned_loss=0.0615, over 21683.00 frames. ], tot_loss[loss=0.2119, simple_loss=0.2926, pruned_loss=0.06555, over 4271144.75 frames. ], batch size: 351, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:26:59,791 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1879476.0, ans=0.125 2023-06-27 19:26:59,985 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1879476.0, ans=0.07 2023-06-27 19:27:51,058 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1879596.0, ans=0.0 2023-06-27 19:27:52,890 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1879596.0, ans=0.0 2023-06-27 19:28:25,637 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.537e+02 6.833e+02 1.058e+03 1.562e+03 3.226e+03, threshold=2.116e+03, percent-clipped=10.0 2023-06-27 19:28:41,976 INFO [train.py:996] (0/4) Epoch 11, batch 8350, loss[loss=0.2043, simple_loss=0.2875, pruned_loss=0.06058, over 21657.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2927, pruned_loss=0.06479, over 4266894.61 frames. ], batch size: 332, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:28:52,137 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1879776.0, ans=0.125 2023-06-27 19:28:57,804 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.42 vs. limit=15.0 2023-06-27 19:28:59,523 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.54 vs. limit=6.0 2023-06-27 19:29:33,178 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.27 vs. limit=15.0 2023-06-27 19:29:34,345 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1879896.0, ans=0.125 2023-06-27 19:29:51,186 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1879956.0, ans=0.125 2023-06-27 19:30:05,847 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1879956.0, ans=0.2 2023-06-27 19:30:29,634 INFO [train.py:996] (0/4) Epoch 11, batch 8400, loss[loss=0.1627, simple_loss=0.2578, pruned_loss=0.03379, over 21736.00 frames. ], tot_loss[loss=0.2052, simple_loss=0.2874, pruned_loss=0.06147, over 4260572.46 frames. ], batch size: 282, lr: 2.68e-03, grad_scale: 32.0 2023-06-27 19:30:39,008 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1880076.0, ans=0.125 2023-06-27 19:30:54,197 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.01 vs. limit=6.0 2023-06-27 19:31:11,661 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1880196.0, ans=0.1 2023-06-27 19:31:51,127 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.798e+02 6.790e+02 1.029e+03 1.707e+03 4.211e+03, threshold=2.059e+03, percent-clipped=16.0 2023-06-27 19:32:11,266 INFO [train.py:996] (0/4) Epoch 11, batch 8450, loss[loss=0.2234, simple_loss=0.2929, pruned_loss=0.07699, over 21904.00 frames. ], tot_loss[loss=0.2058, simple_loss=0.2885, pruned_loss=0.06153, over 4263347.43 frames. ], batch size: 124, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:32:14,941 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1880376.0, ans=0.1 2023-06-27 19:32:41,123 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1880436.0, ans=0.1 2023-06-27 19:32:47,734 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1880496.0, ans=0.125 2023-06-27 19:33:33,919 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1880616.0, ans=0.0 2023-06-27 19:33:48,565 INFO [train.py:996] (0/4) Epoch 11, batch 8500, loss[loss=0.2013, simple_loss=0.2633, pruned_loss=0.0696, over 21725.00 frames. ], tot_loss[loss=0.2051, simple_loss=0.2846, pruned_loss=0.06279, over 4265236.59 frames. ], batch size: 316, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:34:38,579 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1880796.0, ans=0.0 2023-06-27 19:35:17,400 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.145e+02 8.155e+02 1.098e+03 1.780e+03 3.950e+03, threshold=2.195e+03, percent-clipped=18.0 2023-06-27 19:35:37,553 INFO [train.py:996] (0/4) Epoch 11, batch 8550, loss[loss=0.2027, simple_loss=0.2813, pruned_loss=0.06201, over 21262.00 frames. ], tot_loss[loss=0.208, simple_loss=0.2865, pruned_loss=0.06472, over 4270613.55 frames. ], batch size: 144, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:35:38,298 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1880976.0, ans=0.0 2023-06-27 19:35:40,656 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.66 vs. limit=10.0 2023-06-27 19:35:47,224 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1880976.0, ans=0.125 2023-06-27 19:37:18,778 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.74 vs. limit=22.5 2023-06-27 19:37:27,715 INFO [train.py:996] (0/4) Epoch 11, batch 8600, loss[loss=0.2105, simple_loss=0.2941, pruned_loss=0.06348, over 21726.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2933, pruned_loss=0.06677, over 4265211.43 frames. ], batch size: 298, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:37:31,797 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1881276.0, ans=0.1 2023-06-27 19:37:52,375 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 19:38:02,155 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1881336.0, ans=0.0 2023-06-27 19:38:05,551 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1881396.0, ans=0.025 2023-06-27 19:38:07,721 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.04 vs. limit=15.0 2023-06-27 19:38:50,814 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.911e+02 7.000e+02 1.009e+03 1.607e+03 3.888e+03, threshold=2.018e+03, percent-clipped=13.0 2023-06-27 19:39:11,183 INFO [train.py:996] (0/4) Epoch 11, batch 8650, loss[loss=0.1638, simple_loss=0.2569, pruned_loss=0.03532, over 21760.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2993, pruned_loss=0.06691, over 4263798.79 frames. ], batch size: 298, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:39:40,063 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1881636.0, ans=0.2 2023-06-27 19:40:07,649 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1881696.0, ans=0.0 2023-06-27 19:40:12,674 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1881756.0, ans=0.0 2023-06-27 19:40:39,850 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=1881816.0, ans=10.0 2023-06-27 19:40:44,862 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1881816.0, ans=0.125 2023-06-27 19:40:52,507 INFO [train.py:996] (0/4) Epoch 11, batch 8700, loss[loss=0.1954, simple_loss=0.2592, pruned_loss=0.06576, over 21767.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.2925, pruned_loss=0.06425, over 4260345.51 frames. ], batch size: 371, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:41:02,966 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1881876.0, ans=0.125 2023-06-27 19:41:28,116 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1881936.0, ans=0.1 2023-06-27 19:42:15,392 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.345e+02 6.737e+02 1.063e+03 1.710e+03 3.619e+03, threshold=2.126e+03, percent-clipped=15.0 2023-06-27 19:42:35,710 INFO [train.py:996] (0/4) Epoch 11, batch 8750, loss[loss=0.2467, simple_loss=0.2935, pruned_loss=0.09992, over 21704.00 frames. ], tot_loss[loss=0.2098, simple_loss=0.29, pruned_loss=0.06476, over 4267511.66 frames. ], batch size: 508, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:43:52,742 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1882356.0, ans=0.0 2023-06-27 19:44:19,333 INFO [train.py:996] (0/4) Epoch 11, batch 8800, loss[loss=0.2392, simple_loss=0.3281, pruned_loss=0.07512, over 21775.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.298, pruned_loss=0.0673, over 4272577.49 frames. ], batch size: 332, lr: 2.68e-03, grad_scale: 32.0 2023-06-27 19:45:06,721 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1882596.0, ans=0.125 2023-06-27 19:45:08,350 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1882596.0, ans=0.0 2023-06-27 19:45:20,230 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1882596.0, ans=0.125 2023-06-27 19:45:33,278 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1882656.0, ans=0.125 2023-06-27 19:45:41,625 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 19:45:43,948 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.62 vs. limit=10.0 2023-06-27 19:45:49,238 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.680e+02 9.134e+02 1.413e+03 2.470e+03 4.738e+03, threshold=2.826e+03, percent-clipped=30.0 2023-06-27 19:46:02,353 INFO [train.py:996] (0/4) Epoch 11, batch 8850, loss[loss=0.2149, simple_loss=0.3081, pruned_loss=0.06083, over 21767.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.3051, pruned_loss=0.07008, over 4277939.82 frames. ], batch size: 282, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:47:39,442 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1883016.0, ans=0.125 2023-06-27 19:47:50,838 INFO [train.py:996] (0/4) Epoch 11, batch 8900, loss[loss=0.1833, simple_loss=0.2722, pruned_loss=0.04721, over 21702.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2977, pruned_loss=0.06865, over 4279837.58 frames. ], batch size: 282, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:48:51,267 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1883196.0, ans=0.125 2023-06-27 19:48:52,796 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1883196.0, ans=0.0 2023-06-27 19:48:53,430 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.72 vs. limit=15.0 2023-06-27 19:49:18,885 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1883316.0, ans=0.95 2023-06-27 19:49:22,057 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1883316.0, ans=0.125 2023-06-27 19:49:23,139 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.611e+02 6.392e+02 1.039e+03 1.753e+03 5.076e+03, threshold=2.078e+03, percent-clipped=8.0 2023-06-27 19:49:25,259 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1883316.0, ans=0.125 2023-06-27 19:49:32,557 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.97 vs. limit=15.0 2023-06-27 19:49:35,727 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.29 vs. limit=15.0 2023-06-27 19:49:36,286 INFO [train.py:996] (0/4) Epoch 11, batch 8950, loss[loss=0.1949, simple_loss=0.2615, pruned_loss=0.06412, over 21348.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2994, pruned_loss=0.06868, over 4273850.94 frames. ], batch size: 194, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:49:38,672 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1883376.0, ans=0.09899494936611666 2023-06-27 19:50:08,025 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1883436.0, ans=0.0 2023-06-27 19:50:18,472 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.14 vs. limit=10.0 2023-06-27 19:50:20,168 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.22 vs. limit=15.0 2023-06-27 19:50:45,039 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.73 vs. limit=22.5 2023-06-27 19:50:46,091 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1883556.0, ans=0.125 2023-06-27 19:51:07,386 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1883616.0, ans=0.125 2023-06-27 19:51:18,655 INFO [train.py:996] (0/4) Epoch 11, batch 9000, loss[loss=0.1826, simple_loss=0.2632, pruned_loss=0.05101, over 21714.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2923, pruned_loss=0.06821, over 4273700.60 frames. ], batch size: 282, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:51:18,657 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-27 19:51:37,894 INFO [train.py:1028] (0/4) Epoch 11, validation: loss=0.2621, simple_loss=0.3543, pruned_loss=0.08494, over 1796401.00 frames. 2023-06-27 19:51:37,895 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23714MB 2023-06-27 19:51:52,741 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 19:52:16,094 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1883736.0, ans=0.0 2023-06-27 19:52:28,017 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1883796.0, ans=0.1 2023-06-27 19:52:28,545 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.26 vs. limit=15.0 2023-06-27 19:53:04,550 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.323e+02 6.298e+02 8.263e+02 1.367e+03 3.761e+03, threshold=1.653e+03, percent-clipped=12.0 2023-06-27 19:53:28,398 INFO [train.py:996] (0/4) Epoch 11, batch 9050, loss[loss=0.1615, simple_loss=0.2486, pruned_loss=0.03726, over 21729.00 frames. ], tot_loss[loss=0.2094, simple_loss=0.2883, pruned_loss=0.06528, over 4280771.71 frames. ], batch size: 247, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:53:48,139 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1883976.0, ans=0.2 2023-06-27 19:53:55,214 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1884036.0, ans=0.05 2023-06-27 19:54:18,444 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1884096.0, ans=0.0 2023-06-27 19:54:43,704 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1884156.0, ans=0.09899494936611666 2023-06-27 19:54:52,580 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.39 vs. limit=6.0 2023-06-27 19:54:53,939 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1884216.0, ans=0.125 2023-06-27 19:55:03,830 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1884216.0, ans=0.125 2023-06-27 19:55:08,862 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1884216.0, ans=0.0 2023-06-27 19:55:13,457 INFO [train.py:996] (0/4) Epoch 11, batch 9100, loss[loss=0.1875, simple_loss=0.2896, pruned_loss=0.04273, over 21721.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.2948, pruned_loss=0.06744, over 4282078.93 frames. ], batch size: 298, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:56:32,396 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1884456.0, ans=0.0 2023-06-27 19:56:44,830 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.255e+02 7.145e+02 1.042e+03 1.570e+03 3.461e+03, threshold=2.085e+03, percent-clipped=19.0 2023-06-27 19:57:03,242 INFO [train.py:996] (0/4) Epoch 11, batch 9150, loss[loss=0.2404, simple_loss=0.3624, pruned_loss=0.0592, over 19809.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2978, pruned_loss=0.06504, over 4272684.08 frames. ], batch size: 702, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:57:15,531 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1884576.0, ans=0.1 2023-06-27 19:57:17,964 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.23 vs. limit=15.0 2023-06-27 19:57:40,403 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1884696.0, ans=0.125 2023-06-27 19:57:59,036 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1884756.0, ans=0.0 2023-06-27 19:58:20,561 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1884756.0, ans=0.125 2023-06-27 19:58:44,155 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn1.whiten.whitening_limit, batch_count=1884816.0, ans=22.5 2023-06-27 19:58:44,155 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.95 vs. limit=22.5 2023-06-27 19:58:44,986 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1884876.0, ans=0.0 2023-06-27 19:58:45,956 INFO [train.py:996] (0/4) Epoch 11, batch 9200, loss[loss=0.2278, simple_loss=0.312, pruned_loss=0.07179, over 21314.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.3006, pruned_loss=0.0646, over 4273515.42 frames. ], batch size: 159, lr: 2.68e-03, grad_scale: 32.0 2023-06-27 19:58:51,746 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1884876.0, ans=0.0 2023-06-27 20:00:00,531 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1885056.0, ans=0.125 2023-06-27 20:00:16,346 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.692e+02 7.193e+02 1.189e+03 2.039e+03 4.796e+03, threshold=2.378e+03, percent-clipped=22.0 2023-06-27 20:00:28,228 INFO [train.py:996] (0/4) Epoch 11, batch 9250, loss[loss=0.2069, simple_loss=0.2837, pruned_loss=0.06507, over 21792.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.3036, pruned_loss=0.06794, over 4274443.04 frames. ], batch size: 124, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 20:00:28,944 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1885176.0, ans=0.125 2023-06-27 20:00:40,402 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1885176.0, ans=0.125 2023-06-27 20:00:55,975 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.12 vs. limit=10.0 2023-06-27 20:01:21,272 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1885296.0, ans=0.2 2023-06-27 20:02:15,114 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1885416.0, ans=0.0 2023-06-27 20:02:17,643 INFO [train.py:996] (0/4) Epoch 11, batch 9300, loss[loss=0.235, simple_loss=0.3302, pruned_loss=0.06988, over 21862.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.2968, pruned_loss=0.0674, over 4273551.87 frames. ], batch size: 372, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 20:03:27,506 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1885656.0, ans=0.125 2023-06-27 20:03:50,077 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.175e+02 5.672e+02 8.335e+02 1.329e+03 3.533e+03, threshold=1.667e+03, percent-clipped=8.0 2023-06-27 20:03:51,232 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.38 vs. limit=15.0 2023-06-27 20:03:59,529 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1885716.0, ans=0.0 2023-06-27 20:04:02,369 INFO [train.py:996] (0/4) Epoch 11, batch 9350, loss[loss=0.2421, simple_loss=0.3251, pruned_loss=0.07956, over 21603.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.3021, pruned_loss=0.06806, over 4274216.84 frames. ], batch size: 389, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 20:04:13,272 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1885776.0, ans=0.0 2023-06-27 20:04:20,081 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1885836.0, ans=0.125 2023-06-27 20:04:47,104 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.74 vs. limit=15.0 2023-06-27 20:04:56,236 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1885896.0, ans=0.2 2023-06-27 20:05:12,878 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1885956.0, ans=0.125 2023-06-27 20:05:26,701 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1885956.0, ans=0.0 2023-06-27 20:05:41,673 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1886016.0, ans=0.0 2023-06-27 20:05:45,785 INFO [train.py:996] (0/4) Epoch 11, batch 9400, loss[loss=0.2009, simple_loss=0.2727, pruned_loss=0.06458, over 21769.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.3037, pruned_loss=0.06821, over 4272816.28 frames. ], batch size: 351, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 20:07:06,148 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.17 vs. limit=15.0 2023-06-27 20:07:14,055 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.18 vs. limit=22.5 2023-06-27 20:07:16,432 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.854e+02 7.483e+02 1.060e+03 1.789e+03 3.889e+03, threshold=2.119e+03, percent-clipped=27.0 2023-06-27 20:07:17,481 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=5.68 vs. limit=22.5 2023-06-27 20:07:25,531 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=6.31 vs. limit=22.5 2023-06-27 20:07:27,686 INFO [train.py:996] (0/4) Epoch 11, batch 9450, loss[loss=0.2075, simple_loss=0.264, pruned_loss=0.07556, over 21412.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2951, pruned_loss=0.06694, over 4274777.84 frames. ], batch size: 475, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 20:07:31,566 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1886376.0, ans=0.0 2023-06-27 20:07:42,084 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1886376.0, ans=0.0 2023-06-27 20:07:56,868 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1886436.0, ans=0.0 2023-06-27 20:08:15,921 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=5.62 vs. limit=22.5 2023-06-27 20:08:20,362 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1886496.0, ans=0.1 2023-06-27 20:08:26,359 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.86 vs. limit=15.0 2023-06-27 20:08:36,311 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.51 vs. limit=15.0 2023-06-27 20:08:43,972 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1886556.0, ans=0.1 2023-06-27 20:08:57,372 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1886616.0, ans=0.125 2023-06-27 20:09:11,692 INFO [train.py:996] (0/4) Epoch 11, batch 9500, loss[loss=0.177, simple_loss=0.2545, pruned_loss=0.04978, over 21622.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.2874, pruned_loss=0.06555, over 4279850.31 frames. ], batch size: 298, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 20:10:16,252 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1886856.0, ans=0.2 2023-06-27 20:10:19,480 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1886856.0, ans=0.0 2023-06-27 20:10:31,040 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1886856.0, ans=0.0 2023-06-27 20:10:38,451 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.105e+02 7.389e+02 1.115e+03 1.559e+03 4.093e+03, threshold=2.229e+03, percent-clipped=13.0 2023-06-27 20:10:40,754 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1886916.0, ans=0.0 2023-06-27 20:10:48,793 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1886976.0, ans=0.0 2023-06-27 20:10:49,782 INFO [train.py:996] (0/4) Epoch 11, batch 9550, loss[loss=0.2663, simple_loss=0.3455, pruned_loss=0.09359, over 21751.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2918, pruned_loss=0.06748, over 4279000.29 frames. ], batch size: 441, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 20:10:58,941 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1886976.0, ans=0.125 2023-06-27 20:11:21,582 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1887036.0, ans=0.1 2023-06-27 20:11:48,021 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1887156.0, ans=0.0 2023-06-27 20:12:19,085 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1887216.0, ans=0.09899494936611666 2023-06-27 20:12:26,515 INFO [train.py:996] (0/4) Epoch 11, batch 9600, loss[loss=0.2086, simple_loss=0.2852, pruned_loss=0.06598, over 21888.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2942, pruned_loss=0.06861, over 4285417.13 frames. ], batch size: 107, lr: 2.68e-03, grad_scale: 32.0 2023-06-27 20:12:54,379 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.68 vs. limit=22.5 2023-06-27 20:13:07,840 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=1887336.0, ans=22.5 2023-06-27 20:13:14,160 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.69 vs. limit=15.0 2023-06-27 20:13:20,352 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1887396.0, ans=0.0 2023-06-27 20:13:50,487 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1887516.0, ans=0.125 2023-06-27 20:13:54,722 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.225e+02 7.650e+02 1.090e+03 1.713e+03 4.107e+03, threshold=2.181e+03, percent-clipped=11.0 2023-06-27 20:14:05,156 INFO [train.py:996] (0/4) Epoch 11, batch 9650, loss[loss=0.2603, simple_loss=0.3303, pruned_loss=0.09519, over 21774.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2955, pruned_loss=0.06855, over 4288498.88 frames. ], batch size: 441, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 20:14:25,375 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1887576.0, ans=0.125 2023-06-27 20:14:48,831 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1887636.0, ans=0.2 2023-06-27 20:15:19,118 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1887756.0, ans=10.0 2023-06-27 20:15:53,494 INFO [train.py:996] (0/4) Epoch 11, batch 9700, loss[loss=0.211, simple_loss=0.2842, pruned_loss=0.06889, over 21849.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.2996, pruned_loss=0.06954, over 4282007.21 frames. ], batch size: 107, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 20:17:20,714 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.740e+02 5.952e+02 8.386e+02 1.192e+03 2.882e+03, threshold=1.677e+03, percent-clipped=3.0 2023-06-27 20:17:35,535 INFO [train.py:996] (0/4) Epoch 11, batch 9750, loss[loss=0.2685, simple_loss=0.3449, pruned_loss=0.09611, over 16777.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2933, pruned_loss=0.06868, over 4263453.27 frames. ], batch size: 68, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 20:18:28,056 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.64 vs. limit=15.0 2023-06-27 20:18:29,096 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1888296.0, ans=0.125 2023-06-27 20:19:06,527 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1888416.0, ans=0.125 2023-06-27 20:19:10,798 INFO [train.py:996] (0/4) Epoch 11, batch 9800, loss[loss=0.2046, simple_loss=0.2851, pruned_loss=0.0621, over 21739.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2932, pruned_loss=0.06824, over 4261762.07 frames. ], batch size: 389, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 20:19:19,772 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1888476.0, ans=0.125 2023-06-27 20:20:25,085 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1888656.0, ans=0.125 2023-06-27 20:20:42,518 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.393e+02 6.342e+02 8.538e+02 1.222e+03 6.218e+03, threshold=1.708e+03, percent-clipped=13.0 2023-06-27 20:20:52,441 INFO [train.py:996] (0/4) Epoch 11, batch 9850, loss[loss=0.1967, simple_loss=0.2678, pruned_loss=0.06284, over 21785.00 frames. ], tot_loss[loss=0.2133, simple_loss=0.29, pruned_loss=0.06827, over 4266235.98 frames. ], batch size: 371, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 20:21:30,504 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1888836.0, ans=0.1 2023-06-27 20:21:44,561 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.13 vs. limit=15.0 2023-06-27 20:22:31,153 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1889016.0, ans=0.125 2023-06-27 20:22:35,619 INFO [train.py:996] (0/4) Epoch 11, batch 9900, loss[loss=0.235, simple_loss=0.3084, pruned_loss=0.08076, over 21259.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.2857, pruned_loss=0.06727, over 4259096.32 frames. ], batch size: 159, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 20:22:39,513 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1889076.0, ans=0.035 2023-06-27 20:23:02,130 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1889076.0, ans=0.125 2023-06-27 20:23:10,587 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1889136.0, ans=0.125 2023-06-27 20:23:14,147 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1889136.0, ans=0.125 2023-06-27 20:23:19,547 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1889136.0, ans=0.0 2023-06-27 20:23:39,052 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1889196.0, ans=0.0 2023-06-27 20:23:44,135 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=1889256.0, ans=0.95 2023-06-27 20:23:47,550 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1889256.0, ans=0.2 2023-06-27 20:24:00,439 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1889316.0, ans=0.125 2023-06-27 20:24:07,972 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.793e+02 7.874e+02 1.115e+03 1.655e+03 5.340e+03, threshold=2.230e+03, percent-clipped=22.0 2023-06-27 20:24:18,299 INFO [train.py:996] (0/4) Epoch 11, batch 9950, loss[loss=0.2227, simple_loss=0.2905, pruned_loss=0.07748, over 21863.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2862, pruned_loss=0.0695, over 4261913.15 frames. ], batch size: 98, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 20:24:55,588 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1889436.0, ans=0.1 2023-06-27 20:25:27,643 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1889556.0, ans=0.125 2023-06-27 20:25:51,472 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1889616.0, ans=0.125 2023-06-27 20:25:52,956 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1889616.0, ans=0.125 2023-06-27 20:26:16,675 INFO [train.py:996] (0/4) Epoch 11, batch 10000, loss[loss=0.2159, simple_loss=0.2866, pruned_loss=0.0726, over 21834.00 frames. ], tot_loss[loss=0.2099, simple_loss=0.2823, pruned_loss=0.06871, over 4256004.74 frames. ], batch size: 118, lr: 2.67e-03, grad_scale: 32.0 2023-06-27 20:26:27,527 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1889676.0, ans=0.1 2023-06-27 20:26:32,772 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1889736.0, ans=0.0 2023-06-27 20:27:09,162 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.92 vs. limit=22.5 2023-06-27 20:27:11,820 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1889856.0, ans=0.125 2023-06-27 20:27:12,497 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.95 vs. limit=15.0 2023-06-27 20:27:25,917 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1889856.0, ans=0.125 2023-06-27 20:27:52,691 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.633e+02 6.432e+02 1.028e+03 1.503e+03 2.874e+03, threshold=2.056e+03, percent-clipped=5.0 2023-06-27 20:28:01,329 INFO [train.py:996] (0/4) Epoch 11, batch 10050, loss[loss=0.2634, simple_loss=0.3291, pruned_loss=0.09888, over 21745.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2845, pruned_loss=0.06888, over 4262628.01 frames. ], batch size: 441, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 20:28:05,406 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1889976.0, ans=0.2 2023-06-27 20:28:07,351 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1889976.0, ans=0.125 2023-06-27 20:29:04,750 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1890156.0, ans=0.2 2023-06-27 20:29:11,254 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1890156.0, ans=0.125 2023-06-27 20:29:19,171 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1890156.0, ans=0.1 2023-06-27 20:29:31,059 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.74 vs. limit=15.0 2023-06-27 20:29:41,979 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1890216.0, ans=0.0 2023-06-27 20:29:44,813 INFO [train.py:996] (0/4) Epoch 11, batch 10100, loss[loss=0.1826, simple_loss=0.2505, pruned_loss=0.05731, over 21315.00 frames. ], tot_loss[loss=0.2091, simple_loss=0.2835, pruned_loss=0.06733, over 4263626.89 frames. ], batch size: 194, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 20:30:13,620 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1890336.0, ans=0.5 2023-06-27 20:30:35,777 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1890396.0, ans=0.09899494936611666 2023-06-27 20:31:03,462 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1890456.0, ans=0.125 2023-06-27 20:31:19,822 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.253e+02 6.464e+02 9.043e+02 1.577e+03 3.572e+03, threshold=1.809e+03, percent-clipped=15.0 2023-06-27 20:31:28,337 INFO [train.py:996] (0/4) Epoch 11, batch 10150, loss[loss=0.2004, simple_loss=0.2811, pruned_loss=0.05988, over 21726.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2884, pruned_loss=0.06845, over 4250276.17 frames. ], batch size: 282, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 20:31:32,292 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1890576.0, ans=0.0 2023-06-27 20:31:47,610 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1890576.0, ans=0.125 2023-06-27 20:32:43,565 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1890756.0, ans=0.0 2023-06-27 20:32:53,660 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1890816.0, ans=0.125 2023-06-27 20:32:58,658 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1890816.0, ans=0.0 2023-06-27 20:33:06,433 INFO [train.py:996] (0/4) Epoch 11, batch 10200, loss[loss=0.2053, simple_loss=0.2954, pruned_loss=0.05758, over 21709.00 frames. ], tot_loss[loss=0.2107, simple_loss=0.2878, pruned_loss=0.06681, over 4246393.82 frames. ], batch size: 351, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 20:33:07,589 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.82 vs. limit=15.0 2023-06-27 20:34:41,109 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.060e+02 5.938e+02 9.160e+02 1.393e+03 3.097e+03, threshold=1.832e+03, percent-clipped=16.0 2023-06-27 20:34:49,775 INFO [train.py:996] (0/4) Epoch 11, batch 10250, loss[loss=0.2395, simple_loss=0.3482, pruned_loss=0.06543, over 19968.00 frames. ], tot_loss[loss=0.2031, simple_loss=0.2833, pruned_loss=0.06149, over 4258522.36 frames. ], batch size: 703, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 20:35:00,321 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1891176.0, ans=0.0 2023-06-27 20:35:20,558 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1891236.0, ans=0.0 2023-06-27 20:36:08,604 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1891356.0, ans=0.1 2023-06-27 20:36:17,815 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.70 vs. limit=15.0 2023-06-27 20:36:22,417 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1891416.0, ans=0.125 2023-06-27 20:36:24,096 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1891416.0, ans=0.1 2023-06-27 20:36:24,573 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.19 vs. limit=15.0 2023-06-27 20:36:38,667 INFO [train.py:996] (0/4) Epoch 11, batch 10300, loss[loss=0.2017, simple_loss=0.2975, pruned_loss=0.0529, over 21414.00 frames. ], tot_loss[loss=0.2054, simple_loss=0.2865, pruned_loss=0.06209, over 4265663.57 frames. ], batch size: 211, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 20:36:57,634 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1891536.0, ans=0.1 2023-06-27 20:36:57,693 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1891536.0, ans=0.07 2023-06-27 20:37:39,325 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1891596.0, ans=0.125 2023-06-27 20:37:39,928 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.04 vs. limit=15.0 2023-06-27 20:37:55,173 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.80 vs. limit=15.0 2023-06-27 20:38:14,285 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.623e+02 8.106e+02 1.179e+03 1.696e+03 3.317e+03, threshold=2.359e+03, percent-clipped=22.0 2023-06-27 20:38:17,438 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.58 vs. limit=6.0 2023-06-27 20:38:21,863 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1891776.0, ans=0.125 2023-06-27 20:38:22,837 INFO [train.py:996] (0/4) Epoch 11, batch 10350, loss[loss=0.2804, simple_loss=0.3482, pruned_loss=0.1063, over 21420.00 frames. ], tot_loss[loss=0.2083, simple_loss=0.2899, pruned_loss=0.06338, over 4270867.05 frames. ], batch size: 507, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 20:38:35,483 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1891776.0, ans=0.2 2023-06-27 20:39:08,028 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1891896.0, ans=0.125 2023-06-27 20:39:48,087 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1892016.0, ans=0.125 2023-06-27 20:40:03,135 INFO [train.py:996] (0/4) Epoch 11, batch 10400, loss[loss=0.1386, simple_loss=0.1808, pruned_loss=0.04814, over 21716.00 frames. ], tot_loss[loss=0.2052, simple_loss=0.2839, pruned_loss=0.06323, over 4263283.93 frames. ], batch size: 112, lr: 2.67e-03, grad_scale: 32.0 2023-06-27 20:40:41,128 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1892136.0, ans=0.1 2023-06-27 20:41:36,685 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.431e+02 7.002e+02 1.054e+03 1.542e+03 5.604e+03, threshold=2.109e+03, percent-clipped=11.0 2023-06-27 20:41:43,657 INFO [train.py:996] (0/4) Epoch 11, batch 10450, loss[loss=0.2034, simple_loss=0.2853, pruned_loss=0.06079, over 21458.00 frames. ], tot_loss[loss=0.2086, simple_loss=0.287, pruned_loss=0.06504, over 4259575.06 frames. ], batch size: 211, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 20:42:27,781 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 20:42:41,111 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1892496.0, ans=0.0 2023-06-27 20:43:07,194 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1892556.0, ans=0.0 2023-06-27 20:43:34,408 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1892676.0, ans=0.125 2023-06-27 20:43:35,359 INFO [train.py:996] (0/4) Epoch 11, batch 10500, loss[loss=0.1731, simple_loss=0.2408, pruned_loss=0.05271, over 21504.00 frames. ], tot_loss[loss=0.2062, simple_loss=0.2862, pruned_loss=0.06307, over 4252640.23 frames. ], batch size: 212, lr: 2.67e-03, grad_scale: 8.0 2023-06-27 20:43:52,244 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 20:44:23,860 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.48 vs. limit=15.0 2023-06-27 20:45:06,708 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.280e+02 6.335e+02 9.197e+02 1.411e+03 2.954e+03, threshold=1.839e+03, percent-clipped=7.0 2023-06-27 20:45:11,717 INFO [train.py:996] (0/4) Epoch 11, batch 10550, loss[loss=0.1676, simple_loss=0.2334, pruned_loss=0.05087, over 21619.00 frames. ], tot_loss[loss=0.2025, simple_loss=0.2796, pruned_loss=0.06269, over 4244457.18 frames. ], batch size: 231, lr: 2.67e-03, grad_scale: 8.0 2023-06-27 20:45:33,262 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1892976.0, ans=0.125 2023-06-27 20:45:35,690 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.05 vs. limit=10.0 2023-06-27 20:45:55,224 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1893096.0, ans=0.125 2023-06-27 20:45:58,668 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1893096.0, ans=0.0 2023-06-27 20:46:59,854 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.07 vs. limit=22.5 2023-06-27 20:47:00,293 INFO [train.py:996] (0/4) Epoch 11, batch 10600, loss[loss=0.2195, simple_loss=0.3175, pruned_loss=0.06073, over 21475.00 frames. ], tot_loss[loss=0.2003, simple_loss=0.2763, pruned_loss=0.06215, over 4239818.74 frames. ], batch size: 471, lr: 2.67e-03, grad_scale: 8.0 2023-06-27 20:47:21,861 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1893336.0, ans=0.2 2023-06-27 20:48:45,439 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.137e+02 6.706e+02 1.104e+03 1.395e+03 2.716e+03, threshold=2.208e+03, percent-clipped=10.0 2023-06-27 20:48:50,921 INFO [train.py:996] (0/4) Epoch 11, batch 10650, loss[loss=0.2033, simple_loss=0.2857, pruned_loss=0.06046, over 21672.00 frames. ], tot_loss[loss=0.1998, simple_loss=0.2784, pruned_loss=0.06062, over 4247128.76 frames. ], batch size: 414, lr: 2.67e-03, grad_scale: 8.0 2023-06-27 20:48:55,885 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.21 vs. limit=15.0 2023-06-27 20:49:02,436 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1893576.0, ans=0.125 2023-06-27 20:49:15,757 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1893636.0, ans=0.0 2023-06-27 20:49:29,392 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1893696.0, ans=0.125 2023-06-27 20:49:54,607 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1893756.0, ans=0.125 2023-06-27 20:50:19,987 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1893816.0, ans=0.0 2023-06-27 20:50:31,091 INFO [train.py:996] (0/4) Epoch 11, batch 10700, loss[loss=0.1672, simple_loss=0.2421, pruned_loss=0.04614, over 21447.00 frames. ], tot_loss[loss=0.1986, simple_loss=0.2767, pruned_loss=0.06028, over 4232987.50 frames. ], batch size: 212, lr: 2.67e-03, grad_scale: 8.0 2023-06-27 20:50:31,705 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1893876.0, ans=0.2 2023-06-27 20:50:35,351 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1893876.0, ans=0.1 2023-06-27 20:50:37,276 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=1893876.0, ans=22.5 2023-06-27 20:51:16,916 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1893996.0, ans=0.125 2023-06-27 20:51:26,765 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1893996.0, ans=0.0 2023-06-27 20:51:30,158 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1894056.0, ans=0.0 2023-06-27 20:51:52,996 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 20:52:10,064 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.693e+02 7.296e+02 1.063e+03 2.005e+03 4.294e+03, threshold=2.126e+03, percent-clipped=18.0 2023-06-27 20:52:14,968 INFO [train.py:996] (0/4) Epoch 11, batch 10750, loss[loss=0.2493, simple_loss=0.3583, pruned_loss=0.07022, over 21340.00 frames. ], tot_loss[loss=0.2084, simple_loss=0.2876, pruned_loss=0.06459, over 4239351.82 frames. ], batch size: 548, lr: 2.67e-03, grad_scale: 8.0 2023-06-27 20:52:48,351 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1894236.0, ans=10.0 2023-06-27 20:53:41,785 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1894416.0, ans=0.125 2023-06-27 20:53:43,607 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1894416.0, ans=10.0 2023-06-27 20:54:00,261 INFO [train.py:996] (0/4) Epoch 11, batch 10800, loss[loss=0.2422, simple_loss=0.3179, pruned_loss=0.08332, over 21723.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2922, pruned_loss=0.06546, over 4244301.54 frames. ], batch size: 298, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 20:54:34,831 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1894536.0, ans=0.0 2023-06-27 20:54:34,862 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1894536.0, ans=0.125 2023-06-27 20:54:59,086 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1894596.0, ans=0.0 2023-06-27 20:55:05,596 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1894596.0, ans=0.125 2023-06-27 20:55:27,201 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1894716.0, ans=0.125 2023-06-27 20:55:38,140 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.255e+02 6.646e+02 1.015e+03 1.682e+03 4.029e+03, threshold=2.031e+03, percent-clipped=15.0 2023-06-27 20:55:38,792 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1894716.0, ans=0.0 2023-06-27 20:55:42,109 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1894776.0, ans=0.1 2023-06-27 20:55:43,159 INFO [train.py:996] (0/4) Epoch 11, batch 10850, loss[loss=0.1907, simple_loss=0.2594, pruned_loss=0.061, over 21384.00 frames. ], tot_loss[loss=0.215, simple_loss=0.2954, pruned_loss=0.06731, over 4255974.14 frames. ], batch size: 194, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 20:56:00,447 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1894776.0, ans=0.0 2023-06-27 20:56:38,581 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1894896.0, ans=0.125 2023-06-27 20:56:46,920 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1894896.0, ans=0.0 2023-06-27 20:56:50,352 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1894896.0, ans=0.125 2023-06-27 20:57:04,324 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1894956.0, ans=0.07 2023-06-27 20:57:23,721 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1895016.0, ans=0.0 2023-06-27 20:57:27,850 INFO [train.py:996] (0/4) Epoch 11, batch 10900, loss[loss=0.1915, simple_loss=0.2776, pruned_loss=0.05267, over 21569.00 frames. ], tot_loss[loss=0.2098, simple_loss=0.2888, pruned_loss=0.0654, over 4252942.51 frames. ], batch size: 230, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 20:58:27,501 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1895196.0, ans=0.0 2023-06-27 20:58:59,323 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.887e+02 5.631e+02 8.269e+02 1.201e+03 2.087e+03, threshold=1.654e+03, percent-clipped=2.0 2023-06-27 20:59:04,293 INFO [train.py:996] (0/4) Epoch 11, batch 10950, loss[loss=0.1827, simple_loss=0.2602, pruned_loss=0.05262, over 21939.00 frames. ], tot_loss[loss=0.2055, simple_loss=0.284, pruned_loss=0.06346, over 4256122.48 frames. ], batch size: 125, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 20:59:24,904 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.53 vs. limit=15.0 2023-06-27 20:59:26,204 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1895376.0, ans=0.0 2023-06-27 20:59:36,578 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1895436.0, ans=0.125 2023-06-27 20:59:41,337 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1895436.0, ans=0.1 2023-06-27 20:59:58,671 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.86 vs. limit=15.0 2023-06-27 21:00:05,198 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.08 vs. limit=10.0 2023-06-27 21:00:51,815 INFO [train.py:996] (0/4) Epoch 11, batch 11000, loss[loss=0.2089, simple_loss=0.2802, pruned_loss=0.06882, over 19987.00 frames. ], tot_loss[loss=0.2053, simple_loss=0.2828, pruned_loss=0.06388, over 4256905.53 frames. ], batch size: 703, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 21:01:18,779 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1895736.0, ans=0.125 2023-06-27 21:01:40,063 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.84 vs. limit=15.0 2023-06-27 21:01:49,047 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1895796.0, ans=0.1 2023-06-27 21:01:52,401 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1895796.0, ans=0.125 2023-06-27 21:02:06,340 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.20 vs. limit=8.0 2023-06-27 21:02:12,060 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 21:02:12,617 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.67 vs. limit=15.0 2023-06-27 21:02:24,707 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.482e+02 6.416e+02 9.382e+02 1.390e+03 3.598e+03, threshold=1.876e+03, percent-clipped=17.0 2023-06-27 21:02:28,501 INFO [train.py:996] (0/4) Epoch 11, batch 11050, loss[loss=0.1804, simple_loss=0.2482, pruned_loss=0.05631, over 21577.00 frames. ], tot_loss[loss=0.205, simple_loss=0.28, pruned_loss=0.06495, over 4272365.49 frames. ], batch size: 247, lr: 2.67e-03, grad_scale: 8.0 2023-06-27 21:02:38,580 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-316000.pt 2023-06-27 21:02:57,005 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1896036.0, ans=0.125 2023-06-27 21:03:33,146 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1896096.0, ans=0.2 2023-06-27 21:04:16,378 INFO [train.py:996] (0/4) Epoch 11, batch 11100, loss[loss=0.245, simple_loss=0.3276, pruned_loss=0.08119, over 21429.00 frames. ], tot_loss[loss=0.2043, simple_loss=0.2784, pruned_loss=0.06511, over 4265649.53 frames. ], batch size: 471, lr: 2.67e-03, grad_scale: 8.0 2023-06-27 21:04:25,275 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1896276.0, ans=0.0 2023-06-27 21:04:45,717 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.88 vs. limit=15.0 2023-06-27 21:05:33,593 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.28 vs. limit=15.0 2023-06-27 21:05:43,324 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1896516.0, ans=0.125 2023-06-27 21:05:55,706 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.559e+02 5.976e+02 8.582e+02 1.481e+03 2.937e+03, threshold=1.716e+03, percent-clipped=16.0 2023-06-27 21:05:58,959 INFO [train.py:996] (0/4) Epoch 11, batch 11150, loss[loss=0.2247, simple_loss=0.3216, pruned_loss=0.06391, over 21618.00 frames. ], tot_loss[loss=0.2036, simple_loss=0.2772, pruned_loss=0.06497, over 4270461.99 frames. ], batch size: 414, lr: 2.67e-03, grad_scale: 8.0 2023-06-27 21:07:17,209 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.07 vs. limit=6.0 2023-06-27 21:07:42,371 INFO [train.py:996] (0/4) Epoch 11, batch 11200, loss[loss=0.1797, simple_loss=0.2577, pruned_loss=0.05086, over 21171.00 frames. ], tot_loss[loss=0.2031, simple_loss=0.2762, pruned_loss=0.06501, over 4264683.17 frames. ], batch size: 159, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 21:07:46,677 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1896876.0, ans=0.1 2023-06-27 21:08:12,888 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1896936.0, ans=0.125 2023-06-27 21:08:38,201 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.22 vs. limit=15.0 2023-06-27 21:08:54,142 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1897056.0, ans=0.0 2023-06-27 21:09:20,938 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.297e+02 6.349e+02 8.470e+02 1.226e+03 2.540e+03, threshold=1.694e+03, percent-clipped=7.0 2023-06-27 21:09:24,617 INFO [train.py:996] (0/4) Epoch 11, batch 11250, loss[loss=0.2047, simple_loss=0.2877, pruned_loss=0.06087, over 21734.00 frames. ], tot_loss[loss=0.2031, simple_loss=0.2761, pruned_loss=0.065, over 4268862.55 frames. ], batch size: 391, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 21:09:29,229 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.46 vs. limit=22.5 2023-06-27 21:10:07,234 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1897236.0, ans=0.2 2023-06-27 21:10:33,733 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1897356.0, ans=0.2 2023-06-27 21:10:50,819 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1897416.0, ans=0.125 2023-06-27 21:11:06,515 INFO [train.py:996] (0/4) Epoch 11, batch 11300, loss[loss=0.1988, simple_loss=0.2755, pruned_loss=0.06105, over 21637.00 frames. ], tot_loss[loss=0.2028, simple_loss=0.2762, pruned_loss=0.06471, over 4265777.54 frames. ], batch size: 263, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 21:11:10,308 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=1897476.0, ans=10.0 2023-06-27 21:12:02,931 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1897596.0, ans=0.0 2023-06-27 21:12:32,639 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1897716.0, ans=0.0 2023-06-27 21:12:44,922 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.908e+02 6.761e+02 9.436e+02 1.469e+03 2.612e+03, threshold=1.887e+03, percent-clipped=16.0 2023-06-27 21:12:48,344 INFO [train.py:996] (0/4) Epoch 11, batch 11350, loss[loss=0.1905, simple_loss=0.2718, pruned_loss=0.0546, over 21244.00 frames. ], tot_loss[loss=0.2029, simple_loss=0.2777, pruned_loss=0.06407, over 4268712.22 frames. ], batch size: 159, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 21:12:52,311 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1897776.0, ans=0.2 2023-06-27 21:13:02,394 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1897776.0, ans=0.1 2023-06-27 21:13:14,349 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.30 vs. limit=15.0 2023-06-27 21:13:46,609 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1897896.0, ans=0.0 2023-06-27 21:14:18,356 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1898016.0, ans=0.125 2023-06-27 21:14:30,956 INFO [train.py:996] (0/4) Epoch 11, batch 11400, loss[loss=0.205, simple_loss=0.2994, pruned_loss=0.05527, over 21725.00 frames. ], tot_loss[loss=0.2077, simple_loss=0.2839, pruned_loss=0.06573, over 4264904.27 frames. ], batch size: 332, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 21:14:47,580 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1898076.0, ans=0.1 2023-06-27 21:15:08,979 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1898136.0, ans=0.0 2023-06-27 21:15:57,143 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1898316.0, ans=0.0 2023-06-27 21:16:09,914 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.866e+02 7.031e+02 1.006e+03 1.495e+03 2.656e+03, threshold=2.011e+03, percent-clipped=10.0 2023-06-27 21:16:23,426 INFO [train.py:996] (0/4) Epoch 11, batch 11450, loss[loss=0.1972, simple_loss=0.2839, pruned_loss=0.05526, over 21740.00 frames. ], tot_loss[loss=0.2069, simple_loss=0.2853, pruned_loss=0.06422, over 4275665.75 frames. ], batch size: 332, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 21:16:34,022 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1898376.0, ans=0.0 2023-06-27 21:16:49,195 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1898436.0, ans=0.2 2023-06-27 21:16:54,171 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1898436.0, ans=0.0 2023-06-27 21:17:17,969 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1898556.0, ans=0.125 2023-06-27 21:17:57,567 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1898616.0, ans=0.2 2023-06-27 21:18:06,596 INFO [train.py:996] (0/4) Epoch 11, batch 11500, loss[loss=0.2253, simple_loss=0.3204, pruned_loss=0.06515, over 21668.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2901, pruned_loss=0.06614, over 4280218.65 frames. ], batch size: 441, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 21:18:18,853 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1898676.0, ans=0.1 2023-06-27 21:18:30,707 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1898736.0, ans=0.125 2023-06-27 21:18:49,386 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1898796.0, ans=0.1 2023-06-27 21:19:10,500 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=9.07 vs. limit=15.0 2023-06-27 21:19:48,639 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.762e+02 6.937e+02 1.166e+03 1.634e+03 3.269e+03, threshold=2.333e+03, percent-clipped=13.0 2023-06-27 21:19:52,367 INFO [train.py:996] (0/4) Epoch 11, batch 11550, loss[loss=0.2186, simple_loss=0.313, pruned_loss=0.06211, over 21697.00 frames. ], tot_loss[loss=0.215, simple_loss=0.2967, pruned_loss=0.0666, over 4281838.84 frames. ], batch size: 247, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 21:20:21,616 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1899036.0, ans=0.125 2023-06-27 21:20:39,889 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.48 vs. limit=15.0 2023-06-27 21:21:10,366 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1899156.0, ans=0.125 2023-06-27 21:21:18,701 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1899156.0, ans=0.0 2023-06-27 21:21:25,328 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1899216.0, ans=0.2 2023-06-27 21:21:28,736 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1899216.0, ans=0.0 2023-06-27 21:21:38,050 INFO [train.py:996] (0/4) Epoch 11, batch 11600, loss[loss=0.2252, simple_loss=0.3317, pruned_loss=0.05939, over 21673.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.3112, pruned_loss=0.06888, over 4282455.11 frames. ], batch size: 298, lr: 2.67e-03, grad_scale: 32.0 2023-06-27 21:21:46,871 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1899276.0, ans=0.1 2023-06-27 21:22:24,550 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1899396.0, ans=0.0 2023-06-27 21:22:39,327 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1899396.0, ans=0.125 2023-06-27 21:22:45,926 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1899456.0, ans=0.125 2023-06-27 21:22:52,711 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1899456.0, ans=0.0 2023-06-27 21:22:59,245 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1899516.0, ans=0.0 2023-06-27 21:23:15,068 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.239e+02 7.630e+02 1.389e+03 2.274e+03 4.713e+03, threshold=2.778e+03, percent-clipped=21.0 2023-06-27 21:23:16,791 INFO [train.py:996] (0/4) Epoch 11, batch 11650, loss[loss=0.1955, simple_loss=0.2792, pruned_loss=0.05594, over 21623.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3181, pruned_loss=0.07056, over 4282643.18 frames. ], batch size: 298, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 21:24:32,825 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1899756.0, ans=0.125 2023-06-27 21:24:41,151 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1899816.0, ans=0.125 2023-06-27 21:24:49,323 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1899816.0, ans=0.0 2023-06-27 21:24:53,626 INFO [train.py:996] (0/4) Epoch 11, batch 11700, loss[loss=0.1987, simple_loss=0.2669, pruned_loss=0.06519, over 21860.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.309, pruned_loss=0.06938, over 4279510.87 frames. ], batch size: 107, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 21:25:37,213 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1899996.0, ans=0.04949747468305833 2023-06-27 21:26:19,553 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.71 vs. limit=22.5 2023-06-27 21:26:25,872 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1900116.0, ans=0.125 2023-06-27 21:26:28,327 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.521e+02 7.314e+02 1.090e+03 1.615e+03 2.478e+03, threshold=2.180e+03, percent-clipped=0.0 2023-06-27 21:26:29,959 INFO [train.py:996] (0/4) Epoch 11, batch 11750, loss[loss=0.1978, simple_loss=0.2739, pruned_loss=0.06089, over 21673.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2993, pruned_loss=0.06877, over 4278668.52 frames. ], batch size: 247, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 21:26:37,872 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.17 vs. limit=6.0 2023-06-27 21:26:49,086 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1900236.0, ans=0.07 2023-06-27 21:27:04,274 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1900236.0, ans=0.125 2023-06-27 21:27:34,633 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1900356.0, ans=0.0 2023-06-27 21:28:07,507 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1900476.0, ans=0.125 2023-06-27 21:28:08,579 INFO [train.py:996] (0/4) Epoch 11, batch 11800, loss[loss=0.2263, simple_loss=0.2988, pruned_loss=0.07689, over 21287.00 frames. ], tot_loss[loss=0.22, simple_loss=0.2997, pruned_loss=0.07015, over 4276078.28 frames. ], batch size: 176, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 21:28:52,283 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1900596.0, ans=0.2 2023-06-27 21:29:43,928 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1900716.0, ans=0.125 2023-06-27 21:29:44,971 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.871e+02 7.084e+02 9.791e+02 1.465e+03 2.454e+03, threshold=1.958e+03, percent-clipped=4.0 2023-06-27 21:29:46,622 INFO [train.py:996] (0/4) Epoch 11, batch 11850, loss[loss=0.1811, simple_loss=0.2731, pruned_loss=0.04459, over 21323.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.3011, pruned_loss=0.06908, over 4281387.29 frames. ], batch size: 176, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 21:29:50,825 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1900776.0, ans=0.1 2023-06-27 21:30:59,230 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1900956.0, ans=0.125 2023-06-27 21:31:02,770 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1900956.0, ans=0.0 2023-06-27 21:31:17,755 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1901016.0, ans=0.125 2023-06-27 21:31:25,867 INFO [train.py:996] (0/4) Epoch 11, batch 11900, loss[loss=0.2428, simple_loss=0.326, pruned_loss=0.07976, over 21440.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.3024, pruned_loss=0.06747, over 4279235.50 frames. ], batch size: 507, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 21:31:57,165 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1901136.0, ans=0.125 2023-06-27 21:32:19,699 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1901196.0, ans=0.125 2023-06-27 21:32:35,480 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1901256.0, ans=0.2 2023-06-27 21:33:08,450 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.313e+02 5.634e+02 7.548e+02 1.171e+03 3.128e+03, threshold=1.510e+03, percent-clipped=7.0 2023-06-27 21:33:14,927 INFO [train.py:996] (0/4) Epoch 11, batch 11950, loss[loss=0.1691, simple_loss=0.2522, pruned_loss=0.04304, over 21276.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.3039, pruned_loss=0.06527, over 4271056.47 frames. ], batch size: 176, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 21:33:41,840 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1901436.0, ans=0.125 2023-06-27 21:34:52,071 INFO [train.py:996] (0/4) Epoch 11, batch 12000, loss[loss=0.1895, simple_loss=0.2603, pruned_loss=0.05934, over 15899.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.3006, pruned_loss=0.06431, over 4263081.61 frames. ], batch size: 60, lr: 2.67e-03, grad_scale: 32.0 2023-06-27 21:34:52,073 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-27 21:35:06,223 INFO [zipformer.py:1728] (0/4) name=encoder.encoders.4.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([1.6341, 2.2043, 3.3560, 2.1865], device='cuda:0') 2023-06-27 21:35:11,388 INFO [zipformer.py:1728] (0/4) name=encoder.encoders.1.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([4.6728, 2.1417, 4.0471, 2.2617], device='cuda:0') 2023-06-27 21:35:12,134 INFO [train.py:1028] (0/4) Epoch 11, validation: loss=0.2616, simple_loss=0.3513, pruned_loss=0.08594, over 1796401.00 frames. 2023-06-27 21:35:12,135 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23714MB 2023-06-27 21:36:59,869 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.508e+02 6.629e+02 1.038e+03 1.682e+03 4.454e+03, threshold=2.077e+03, percent-clipped=31.0 2023-06-27 21:36:59,900 INFO [train.py:996] (0/4) Epoch 11, batch 12050, loss[loss=0.2013, simple_loss=0.2724, pruned_loss=0.06514, over 21564.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2957, pruned_loss=0.06539, over 4269285.50 frames. ], batch size: 212, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 21:37:20,555 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1902036.0, ans=0.1 2023-06-27 21:37:30,980 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1902036.0, ans=0.125 2023-06-27 21:37:49,108 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1902096.0, ans=0.125 2023-06-27 21:38:43,461 INFO [train.py:996] (0/4) Epoch 11, batch 12100, loss[loss=0.295, simple_loss=0.355, pruned_loss=0.1175, over 21440.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2979, pruned_loss=0.06907, over 4276255.82 frames. ], batch size: 471, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 21:38:58,040 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1902276.0, ans=0.125 2023-06-27 21:40:26,822 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.09 vs. limit=15.0 2023-06-27 21:40:29,113 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.049e+02 8.933e+02 1.358e+03 2.118e+03 4.417e+03, threshold=2.716e+03, percent-clipped=26.0 2023-06-27 21:40:29,143 INFO [train.py:996] (0/4) Epoch 11, batch 12150, loss[loss=0.2126, simple_loss=0.3107, pruned_loss=0.0572, over 21719.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2989, pruned_loss=0.06789, over 4267632.37 frames. ], batch size: 298, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 21:40:29,872 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1902576.0, ans=0.1 2023-06-27 21:40:43,674 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.29 vs. limit=15.0 2023-06-27 21:40:53,553 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1902636.0, ans=0.125 2023-06-27 21:42:09,512 INFO [train.py:996] (0/4) Epoch 11, batch 12200, loss[loss=0.1814, simple_loss=0.2598, pruned_loss=0.05144, over 21590.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2976, pruned_loss=0.06662, over 4264374.08 frames. ], batch size: 263, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 21:42:38,556 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1902936.0, ans=0.125 2023-06-27 21:42:51,861 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=1902996.0, ans=6.0 2023-06-27 21:43:29,992 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1903056.0, ans=0.125 2023-06-27 21:43:34,781 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1903116.0, ans=0.125 2023-06-27 21:43:50,746 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.836e+02 6.388e+02 1.060e+03 1.811e+03 4.082e+03, threshold=2.119e+03, percent-clipped=7.0 2023-06-27 21:43:50,793 INFO [train.py:996] (0/4) Epoch 11, batch 12250, loss[loss=0.1434, simple_loss=0.2209, pruned_loss=0.03297, over 16372.00 frames. ], tot_loss[loss=0.2074, simple_loss=0.2884, pruned_loss=0.06323, over 4255457.40 frames. ], batch size: 61, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 21:44:01,511 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1903176.0, ans=0.0 2023-06-27 21:44:12,360 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1903236.0, ans=0.1 2023-06-27 21:44:50,160 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1903356.0, ans=0.125 2023-06-27 21:45:34,144 INFO [train.py:996] (0/4) Epoch 11, batch 12300, loss[loss=0.1889, simple_loss=0.2853, pruned_loss=0.04624, over 21765.00 frames. ], tot_loss[loss=0.2002, simple_loss=0.2821, pruned_loss=0.05911, over 4251040.44 frames. ], batch size: 298, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 21:46:05,911 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.40 vs. limit=12.0 2023-06-27 21:46:49,570 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1903656.0, ans=0.125 2023-06-27 21:47:16,524 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.514e+02 6.456e+02 1.093e+03 1.764e+03 5.046e+03, threshold=2.186e+03, percent-clipped=16.0 2023-06-27 21:47:16,570 INFO [train.py:996] (0/4) Epoch 11, batch 12350, loss[loss=0.2596, simple_loss=0.3353, pruned_loss=0.09192, over 21776.00 frames. ], tot_loss[loss=0.2031, simple_loss=0.2864, pruned_loss=0.05986, over 4251097.35 frames. ], batch size: 441, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 21:47:31,883 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1903836.0, ans=0.125 2023-06-27 21:48:06,218 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 21:48:57,241 INFO [train.py:996] (0/4) Epoch 11, batch 12400, loss[loss=0.2084, simple_loss=0.2856, pruned_loss=0.06566, over 21736.00 frames. ], tot_loss[loss=0.2076, simple_loss=0.2887, pruned_loss=0.06328, over 4260949.15 frames. ], batch size: 389, lr: 2.66e-03, grad_scale: 32.0 2023-06-27 21:49:09,395 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1904076.0, ans=0.09899494936611666 2023-06-27 21:49:55,706 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1904256.0, ans=0.0 2023-06-27 21:50:30,019 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1904316.0, ans=0.125 2023-06-27 21:50:33,372 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1904316.0, ans=0.125 2023-06-27 21:50:39,837 INFO [train.py:996] (0/4) Epoch 11, batch 12450, loss[loss=0.2719, simple_loss=0.3481, pruned_loss=0.09784, over 21831.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2931, pruned_loss=0.0666, over 4275286.31 frames. ], batch size: 124, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 21:50:40,478 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1904376.0, ans=0.2 2023-06-27 21:50:41,639 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.549e+02 6.531e+02 8.502e+02 1.313e+03 3.916e+03, threshold=1.700e+03, percent-clipped=4.0 2023-06-27 21:50:42,985 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.27 vs. limit=15.0 2023-06-27 21:50:56,049 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1904376.0, ans=0.0 2023-06-27 21:51:41,682 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.60 vs. limit=12.0 2023-06-27 21:51:48,670 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=6.39 vs. limit=22.5 2023-06-27 21:52:07,396 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=1904616.0, ans=22.5 2023-06-27 21:52:20,327 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1904616.0, ans=0.2 2023-06-27 21:52:29,349 INFO [train.py:996] (0/4) Epoch 11, batch 12500, loss[loss=0.2909, simple_loss=0.372, pruned_loss=0.1049, over 21712.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.3038, pruned_loss=0.06954, over 4280866.11 frames. ], batch size: 441, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 21:52:41,910 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1904676.0, ans=0.125 2023-06-27 21:53:19,225 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1904796.0, ans=0.125 2023-06-27 21:53:24,492 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1904796.0, ans=0.125 2023-06-27 21:54:10,058 INFO [train.py:996] (0/4) Epoch 11, batch 12550, loss[loss=0.1744, simple_loss=0.2866, pruned_loss=0.03112, over 20736.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.3058, pruned_loss=0.07069, over 4277291.75 frames. ], batch size: 608, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 21:54:11,834 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.997e+02 7.151e+02 9.738e+02 1.410e+03 2.995e+03, threshold=1.948e+03, percent-clipped=12.0 2023-06-27 21:55:03,498 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.23 vs. limit=15.0 2023-06-27 21:55:19,706 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1905156.0, ans=0.025 2023-06-27 21:55:22,797 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1905156.0, ans=0.07 2023-06-27 21:55:26,386 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1905156.0, ans=0.1 2023-06-27 21:55:29,584 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1905156.0, ans=0.125 2023-06-27 21:55:29,678 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1905156.0, ans=0.05 2023-06-27 21:55:53,715 INFO [train.py:996] (0/4) Epoch 11, batch 12600, loss[loss=0.1997, simple_loss=0.2835, pruned_loss=0.05797, over 21427.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.3048, pruned_loss=0.06883, over 4281160.57 frames. ], batch size: 194, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 21:56:17,537 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1905336.0, ans=0.125 2023-06-27 21:56:38,503 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1905396.0, ans=0.04949747468305833 2023-06-27 21:57:30,666 INFO [train.py:996] (0/4) Epoch 11, batch 12650, loss[loss=0.2167, simple_loss=0.3274, pruned_loss=0.05296, over 19825.00 frames. ], tot_loss[loss=0.2142, simple_loss=0.2977, pruned_loss=0.06534, over 4283042.55 frames. ], batch size: 702, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 21:57:36,981 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.857e+02 6.084e+02 8.925e+02 1.601e+03 4.127e+03, threshold=1.785e+03, percent-clipped=11.0 2023-06-27 21:57:52,352 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1905576.0, ans=0.0 2023-06-27 21:58:01,789 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1905636.0, ans=0.125 2023-06-27 21:58:51,788 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1905816.0, ans=0.125 2023-06-27 21:59:13,099 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1905816.0, ans=0.125 2023-06-27 21:59:17,351 INFO [train.py:996] (0/4) Epoch 11, batch 12700, loss[loss=0.2435, simple_loss=0.3138, pruned_loss=0.08665, over 21632.00 frames. ], tot_loss[loss=0.216, simple_loss=0.2969, pruned_loss=0.06758, over 4288967.29 frames. ], batch size: 415, lr: 2.66e-03, grad_scale: 8.0 2023-06-27 21:59:21,271 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1905876.0, ans=0.125 2023-06-27 21:59:38,002 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1905936.0, ans=0.125 2023-06-27 22:00:07,926 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1905996.0, ans=0.125 2023-06-27 22:00:19,584 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1906056.0, ans=0.0 2023-06-27 22:00:59,916 INFO [train.py:996] (0/4) Epoch 11, batch 12750, loss[loss=0.2017, simple_loss=0.2921, pruned_loss=0.05565, over 21812.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2961, pruned_loss=0.06708, over 4287591.42 frames. ], batch size: 282, lr: 2.66e-03, grad_scale: 8.0 2023-06-27 22:01:03,058 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.884e+02 6.384e+02 9.703e+02 1.626e+03 3.460e+03, threshold=1.941e+03, percent-clipped=17.0 2023-06-27 22:01:15,387 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1906236.0, ans=0.2 2023-06-27 22:01:21,907 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1906236.0, ans=0.125 2023-06-27 22:01:31,673 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1906296.0, ans=0.0 2023-06-27 22:02:11,496 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1906416.0, ans=0.1 2023-06-27 22:02:37,427 INFO [train.py:996] (0/4) Epoch 11, batch 12800, loss[loss=0.2149, simple_loss=0.2938, pruned_loss=0.06802, over 21675.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2964, pruned_loss=0.06808, over 4287630.15 frames. ], batch size: 263, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:04:16,512 INFO [train.py:996] (0/4) Epoch 11, batch 12850, loss[loss=0.2377, simple_loss=0.3154, pruned_loss=0.08, over 21302.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.2986, pruned_loss=0.06924, over 4284047.22 frames. ], batch size: 143, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:04:19,911 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.496e+02 6.619e+02 8.381e+02 1.196e+03 2.769e+03, threshold=1.676e+03, percent-clipped=10.0 2023-06-27 22:05:41,984 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.13 vs. limit=15.0 2023-06-27 22:05:56,194 INFO [train.py:996] (0/4) Epoch 11, batch 12900, loss[loss=0.2927, simple_loss=0.3633, pruned_loss=0.111, over 21436.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2975, pruned_loss=0.06661, over 4284785.66 frames. ], batch size: 507, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:06:21,547 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.88 vs. limit=22.5 2023-06-27 22:06:24,547 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1907136.0, ans=0.125 2023-06-27 22:06:29,565 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1907136.0, ans=0.0 2023-06-27 22:07:00,184 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1907196.0, ans=0.125 2023-06-27 22:07:27,971 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.34 vs. limit=15.0 2023-06-27 22:07:38,234 INFO [train.py:996] (0/4) Epoch 11, batch 12950, loss[loss=0.2532, simple_loss=0.3225, pruned_loss=0.09195, over 21419.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2959, pruned_loss=0.06527, over 4279631.83 frames. ], batch size: 471, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:07:46,035 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.286e+02 5.649e+02 7.456e+02 9.840e+02 3.735e+03, threshold=1.491e+03, percent-clipped=7.0 2023-06-27 22:07:46,794 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1907376.0, ans=0.125 2023-06-27 22:09:03,346 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.65 vs. limit=15.0 2023-06-27 22:09:18,534 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1907676.0, ans=0.0 2023-06-27 22:09:19,446 INFO [train.py:996] (0/4) Epoch 11, batch 13000, loss[loss=0.1275, simple_loss=0.1928, pruned_loss=0.03108, over 21801.00 frames. ], tot_loss[loss=0.2125, simple_loss=0.2946, pruned_loss=0.06523, over 4281552.04 frames. ], batch size: 98, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:10:08,859 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=1907796.0, ans=0.05 2023-06-27 22:10:20,695 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.17 vs. limit=15.0 2023-06-27 22:10:21,633 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1907796.0, ans=0.125 2023-06-27 22:10:24,886 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1907796.0, ans=0.1 2023-06-27 22:11:06,075 INFO [train.py:996] (0/4) Epoch 11, batch 13050, loss[loss=0.2172, simple_loss=0.2925, pruned_loss=0.07097, over 21529.00 frames. ], tot_loss[loss=0.208, simple_loss=0.2898, pruned_loss=0.06314, over 4287803.13 frames. ], batch size: 212, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:11:09,269 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.155e+02 7.836e+02 1.180e+03 1.629e+03 3.232e+03, threshold=2.361e+03, percent-clipped=34.0 2023-06-27 22:11:31,452 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 22:11:36,397 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1908036.0, ans=0.2 2023-06-27 22:12:43,794 INFO [train.py:996] (0/4) Epoch 11, batch 13100, loss[loss=0.1837, simple_loss=0.2834, pruned_loss=0.04197, over 21789.00 frames. ], tot_loss[loss=0.2085, simple_loss=0.2912, pruned_loss=0.06287, over 4294853.61 frames. ], batch size: 332, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:13:19,896 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1908336.0, ans=0.0 2023-06-27 22:13:21,582 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1908396.0, ans=0.125 2023-06-27 22:13:25,081 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1908396.0, ans=0.0 2023-06-27 22:13:30,255 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1908396.0, ans=0.2 2023-06-27 22:14:02,942 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1908516.0, ans=0.07 2023-06-27 22:14:03,403 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.26 vs. limit=12.0 2023-06-27 22:14:19,041 INFO [train.py:996] (0/4) Epoch 11, batch 13150, loss[loss=0.2062, simple_loss=0.2923, pruned_loss=0.0601, over 21493.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.295, pruned_loss=0.06556, over 4294350.02 frames. ], batch size: 389, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:14:22,233 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.348e+02 6.692e+02 9.537e+02 1.354e+03 2.505e+03, threshold=1.907e+03, percent-clipped=1.0 2023-06-27 22:14:48,596 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1908636.0, ans=0.0 2023-06-27 22:15:41,904 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1908756.0, ans=0.5 2023-06-27 22:16:12,344 INFO [train.py:996] (0/4) Epoch 11, batch 13200, loss[loss=0.2343, simple_loss=0.3124, pruned_loss=0.07808, over 21897.00 frames. ], tot_loss[loss=0.2121, simple_loss=0.2944, pruned_loss=0.06485, over 4291641.10 frames. ], batch size: 372, lr: 2.66e-03, grad_scale: 32.0 2023-06-27 22:16:16,582 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1908876.0, ans=0.125 2023-06-27 22:16:29,872 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1908936.0, ans=0.125 2023-06-27 22:16:34,538 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1908936.0, ans=0.0 2023-06-27 22:16:34,620 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1908936.0, ans=0.125 2023-06-27 22:17:12,775 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.72 vs. limit=15.0 2023-06-27 22:17:46,294 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1909116.0, ans=0.2 2023-06-27 22:17:50,828 INFO [train.py:996] (0/4) Epoch 11, batch 13250, loss[loss=0.2434, simple_loss=0.336, pruned_loss=0.07539, over 21523.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.2947, pruned_loss=0.06739, over 4294044.38 frames. ], batch size: 471, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:17:54,571 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1909176.0, ans=0.125 2023-06-27 22:17:55,791 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.059e+02 8.358e+02 1.341e+03 1.799e+03 2.954e+03, threshold=2.682e+03, percent-clipped=21.0 2023-06-27 22:18:13,664 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.48 vs. limit=15.0 2023-06-27 22:18:56,799 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1909356.0, ans=0.125 2023-06-27 22:19:34,351 INFO [train.py:996] (0/4) Epoch 11, batch 13300, loss[loss=0.2452, simple_loss=0.3342, pruned_loss=0.07805, over 21688.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2961, pruned_loss=0.06752, over 4293861.14 frames. ], batch size: 441, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:19:41,924 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1909476.0, ans=0.0 2023-06-27 22:19:59,034 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1909536.0, ans=0.1 2023-06-27 22:20:11,017 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1909536.0, ans=0.05 2023-06-27 22:20:16,779 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1909596.0, ans=0.1 2023-06-27 22:21:01,143 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1909716.0, ans=0.0 2023-06-27 22:21:18,955 INFO [train.py:996] (0/4) Epoch 11, batch 13350, loss[loss=0.2273, simple_loss=0.3197, pruned_loss=0.06747, over 21712.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.3003, pruned_loss=0.07013, over 4291499.88 frames. ], batch size: 298, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:21:23,889 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.586e+02 8.555e+02 1.217e+03 1.843e+03 4.034e+03, threshold=2.434e+03, percent-clipped=8.0 2023-06-27 22:21:24,285 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1909776.0, ans=0.125 2023-06-27 22:21:26,761 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=6.89 vs. limit=15.0 2023-06-27 22:21:31,199 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1909776.0, ans=0.125 2023-06-27 22:22:02,343 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1909896.0, ans=0.07 2023-06-27 22:22:10,561 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1909896.0, ans=0.125 2023-06-27 22:22:12,409 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1909956.0, ans=0.95 2023-06-27 22:22:54,901 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1910016.0, ans=0.2 2023-06-27 22:23:00,836 INFO [train.py:996] (0/4) Epoch 11, batch 13400, loss[loss=0.2182, simple_loss=0.29, pruned_loss=0.07323, over 21832.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.3017, pruned_loss=0.07152, over 4295225.01 frames. ], batch size: 282, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:23:45,928 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1910196.0, ans=0.125 2023-06-27 22:24:10,198 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1910256.0, ans=0.0 2023-06-27 22:24:43,493 INFO [train.py:996] (0/4) Epoch 11, batch 13450, loss[loss=0.2064, simple_loss=0.2963, pruned_loss=0.05821, over 20743.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.3039, pruned_loss=0.07339, over 4294500.38 frames. ], batch size: 607, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:24:52,951 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.204e+02 6.376e+02 8.067e+02 1.099e+03 2.577e+03, threshold=1.613e+03, percent-clipped=1.0 2023-06-27 22:25:50,447 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1910496.0, ans=0.1 2023-06-27 22:26:31,877 INFO [train.py:996] (0/4) Epoch 11, batch 13500, loss[loss=0.1659, simple_loss=0.2402, pruned_loss=0.0458, over 21437.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2953, pruned_loss=0.07036, over 4281702.31 frames. ], batch size: 211, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:26:41,995 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.49 vs. limit=10.0 2023-06-27 22:27:34,803 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.56 vs. limit=15.0 2023-06-27 22:28:07,879 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=11.59 vs. limit=15.0 2023-06-27 22:28:11,315 INFO [train.py:996] (0/4) Epoch 11, batch 13550, loss[loss=0.2224, simple_loss=0.3207, pruned_loss=0.06206, over 21349.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.2993, pruned_loss=0.07003, over 4281370.80 frames. ], batch size: 194, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:28:16,104 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.678e+02 8.007e+02 1.277e+03 1.961e+03 4.546e+03, threshold=2.554e+03, percent-clipped=33.0 2023-06-27 22:28:57,843 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1911096.0, ans=10.0 2023-06-27 22:29:30,787 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1911156.0, ans=0.0 2023-06-27 22:29:32,305 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1911156.0, ans=0.125 2023-06-27 22:29:35,607 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1911216.0, ans=0.1 2023-06-27 22:29:53,228 INFO [train.py:996] (0/4) Epoch 11, batch 13600, loss[loss=0.1993, simple_loss=0.2772, pruned_loss=0.06067, over 21417.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.3004, pruned_loss=0.07117, over 4283172.01 frames. ], batch size: 159, lr: 2.66e-03, grad_scale: 32.0 2023-06-27 22:29:57,434 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1911276.0, ans=0.1 2023-06-27 22:30:13,640 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1911336.0, ans=0.0 2023-06-27 22:30:39,276 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1911396.0, ans=0.015 2023-06-27 22:31:20,065 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1911516.0, ans=0.0 2023-06-27 22:31:34,440 INFO [train.py:996] (0/4) Epoch 11, batch 13650, loss[loss=0.1847, simple_loss=0.2605, pruned_loss=0.05438, over 21701.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2949, pruned_loss=0.06826, over 4281668.24 frames. ], batch size: 282, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:31:45,847 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.126e+02 5.497e+02 8.318e+02 1.364e+03 3.376e+03, threshold=1.664e+03, percent-clipped=5.0 2023-06-27 22:31:51,497 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1911576.0, ans=0.125 2023-06-27 22:32:06,198 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1911636.0, ans=0.125 2023-06-27 22:33:13,812 INFO [train.py:996] (0/4) Epoch 11, batch 13700, loss[loss=0.2912, simple_loss=0.3557, pruned_loss=0.1133, over 21479.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2903, pruned_loss=0.06759, over 4271304.03 frames. ], batch size: 508, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:33:24,105 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1911876.0, ans=0.2 2023-06-27 22:33:45,416 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.71 vs. limit=10.0 2023-06-27 22:33:54,388 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.02 vs. limit=22.5 2023-06-27 22:34:02,391 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1911996.0, ans=0.04949747468305833 2023-06-27 22:34:57,745 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.02 vs. limit=22.5 2023-06-27 22:35:01,868 INFO [train.py:996] (0/4) Epoch 11, batch 13750, loss[loss=0.2095, simple_loss=0.2872, pruned_loss=0.06585, over 21668.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2879, pruned_loss=0.06643, over 4262903.39 frames. ], batch size: 298, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:35:13,334 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.554e+02 7.211e+02 1.142e+03 1.644e+03 3.975e+03, threshold=2.283e+03, percent-clipped=24.0 2023-06-27 22:35:15,923 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1912176.0, ans=0.125 2023-06-27 22:35:55,787 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1912296.0, ans=0.1 2023-06-27 22:35:57,724 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1912296.0, ans=0.0 2023-06-27 22:36:26,890 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1912416.0, ans=0.0 2023-06-27 22:36:47,740 INFO [train.py:996] (0/4) Epoch 11, batch 13800, loss[loss=0.2072, simple_loss=0.2971, pruned_loss=0.05866, over 21460.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2933, pruned_loss=0.06605, over 4262575.47 frames. ], batch size: 194, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:37:16,445 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.24 vs. limit=22.5 2023-06-27 22:37:17,791 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=1912536.0, ans=22.5 2023-06-27 22:37:20,702 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_na.min_abs, batch_count=1912536.0, ans=0.02 2023-06-27 22:37:52,816 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.53 vs. limit=15.0 2023-06-27 22:38:31,501 INFO [train.py:996] (0/4) Epoch 11, batch 13850, loss[loss=0.2472, simple_loss=0.3333, pruned_loss=0.0805, over 21763.00 frames. ], tot_loss[loss=0.215, simple_loss=0.2989, pruned_loss=0.06554, over 4265498.37 frames. ], batch size: 332, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:38:38,143 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.536e+02 6.806e+02 9.243e+02 1.369e+03 3.206e+03, threshold=1.849e+03, percent-clipped=7.0 2023-06-27 22:38:53,923 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.87 vs. limit=10.0 2023-06-27 22:39:01,588 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1912836.0, ans=0.2 2023-06-27 22:39:48,198 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1912956.0, ans=0.125 2023-06-27 22:39:50,159 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.53 vs. limit=15.0 2023-06-27 22:40:01,020 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1913016.0, ans=0.125 2023-06-27 22:40:12,074 INFO [train.py:996] (0/4) Epoch 11, batch 13900, loss[loss=0.2206, simple_loss=0.293, pruned_loss=0.07415, over 21853.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.3025, pruned_loss=0.06864, over 4273260.95 frames. ], batch size: 371, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:40:15,913 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1913076.0, ans=0.1 2023-06-27 22:40:30,766 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1913076.0, ans=0.125 2023-06-27 22:40:48,804 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1913196.0, ans=0.05 2023-06-27 22:41:02,256 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=5.49 vs. limit=22.5 2023-06-27 22:41:07,242 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.53 vs. limit=22.5 2023-06-27 22:41:15,073 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1913256.0, ans=0.2 2023-06-27 22:41:49,832 INFO [train.py:996] (0/4) Epoch 11, batch 13950, loss[loss=0.202, simple_loss=0.2809, pruned_loss=0.06159, over 21522.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.3034, pruned_loss=0.06992, over 4277746.43 frames. ], batch size: 194, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:41:58,757 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.68 vs. limit=15.0 2023-06-27 22:42:00,987 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.608e+02 7.245e+02 1.112e+03 1.601e+03 2.924e+03, threshold=2.224e+03, percent-clipped=16.0 2023-06-27 22:42:01,511 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1913376.0, ans=0.0 2023-06-27 22:42:48,800 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1913556.0, ans=0.2 2023-06-27 22:43:29,848 INFO [train.py:996] (0/4) Epoch 11, batch 14000, loss[loss=0.1984, simple_loss=0.289, pruned_loss=0.05393, over 21709.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.3011, pruned_loss=0.06936, over 4274579.02 frames. ], batch size: 389, lr: 2.66e-03, grad_scale: 32.0 2023-06-27 22:43:43,966 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.38 vs. limit=15.0 2023-06-27 22:44:52,099 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1913856.0, ans=0.125 2023-06-27 22:45:15,894 INFO [train.py:996] (0/4) Epoch 11, batch 14050, loss[loss=0.1953, simple_loss=0.2625, pruned_loss=0.06404, over 21308.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2958, pruned_loss=0.06573, over 4284466.39 frames. ], batch size: 144, lr: 2.66e-03, grad_scale: 32.0 2023-06-27 22:45:22,437 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.282e+02 6.606e+02 1.013e+03 1.561e+03 3.162e+03, threshold=2.026e+03, percent-clipped=9.0 2023-06-27 22:45:37,278 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1914036.0, ans=0.125 2023-06-27 22:45:49,233 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.35 vs. limit=15.0 2023-06-27 22:45:56,929 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.12 vs. limit=12.0 2023-06-27 22:46:39,026 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1914216.0, ans=0.2 2023-06-27 22:46:49,390 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1914216.0, ans=0.125 2023-06-27 22:46:57,094 INFO [train.py:996] (0/4) Epoch 11, batch 14100, loss[loss=0.2499, simple_loss=0.322, pruned_loss=0.08892, over 21508.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2898, pruned_loss=0.06554, over 4281962.86 frames. ], batch size: 131, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:47:00,734 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1914276.0, ans=0.125 2023-06-27 22:47:17,283 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1914336.0, ans=0.2 2023-06-27 22:47:46,041 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1914396.0, ans=0.07 2023-06-27 22:48:10,300 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1914456.0, ans=0.125 2023-06-27 22:48:10,354 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1914456.0, ans=0.125 2023-06-27 22:48:16,584 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1914516.0, ans=0.5 2023-06-27 22:48:31,965 INFO [train.py:996] (0/4) Epoch 11, batch 14150, loss[loss=0.2055, simple_loss=0.2919, pruned_loss=0.0596, over 21299.00 frames. ], tot_loss[loss=0.2133, simple_loss=0.293, pruned_loss=0.06678, over 4278059.85 frames. ], batch size: 176, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:48:44,404 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.197e+02 6.392e+02 8.340e+02 1.310e+03 2.692e+03, threshold=1.668e+03, percent-clipped=6.0 2023-06-27 22:48:58,008 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1914636.0, ans=0.125 2023-06-27 22:49:06,964 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 22:49:51,756 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1914816.0, ans=0.0 2023-06-27 22:50:10,696 INFO [train.py:996] (0/4) Epoch 11, batch 14200, loss[loss=0.2088, simple_loss=0.287, pruned_loss=0.0653, over 21328.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2921, pruned_loss=0.06552, over 4280293.24 frames. ], batch size: 159, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:50:46,709 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1914996.0, ans=0.125 2023-06-27 22:51:43,000 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1915116.0, ans=0.125 2023-06-27 22:51:50,593 INFO [train.py:996] (0/4) Epoch 11, batch 14250, loss[loss=0.1786, simple_loss=0.2602, pruned_loss=0.04853, over 21706.00 frames. ], tot_loss[loss=0.2081, simple_loss=0.2865, pruned_loss=0.06486, over 4271619.20 frames. ], batch size: 333, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:51:57,457 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=1915176.0, ans=6.0 2023-06-27 22:51:59,420 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.518e+02 7.474e+02 9.961e+02 1.736e+03 2.961e+03, threshold=1.992e+03, percent-clipped=26.0 2023-06-27 22:52:07,402 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1915236.0, ans=0.125 2023-06-27 22:52:08,843 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 22:53:11,410 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.05 vs. limit=22.5 2023-06-27 22:53:19,366 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1915416.0, ans=0.2 2023-06-27 22:53:34,567 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1915476.0, ans=0.07 2023-06-27 22:53:35,415 INFO [train.py:996] (0/4) Epoch 11, batch 14300, loss[loss=0.1753, simple_loss=0.2594, pruned_loss=0.04558, over 21265.00 frames. ], tot_loss[loss=0.2085, simple_loss=0.289, pruned_loss=0.06404, over 4256891.86 frames. ], batch size: 176, lr: 2.66e-03, grad_scale: 8.0 2023-06-27 22:53:48,419 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1915476.0, ans=0.125 2023-06-27 22:53:49,999 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1915476.0, ans=0.125 2023-06-27 22:54:20,276 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1915596.0, ans=0.0 2023-06-27 22:55:05,832 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1915716.0, ans=0.1 2023-06-27 22:55:18,193 INFO [train.py:996] (0/4) Epoch 11, batch 14350, loss[loss=0.1828, simple_loss=0.2672, pruned_loss=0.04919, over 21412.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2927, pruned_loss=0.0648, over 4251448.37 frames. ], batch size: 194, lr: 2.66e-03, grad_scale: 8.0 2023-06-27 22:55:27,947 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.575e+02 7.291e+02 1.107e+03 2.245e+03 6.428e+03, threshold=2.214e+03, percent-clipped=28.0 2023-06-27 22:56:56,514 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1916016.0, ans=0.2 2023-06-27 22:56:59,064 INFO [train.py:996] (0/4) Epoch 11, batch 14400, loss[loss=0.189, simple_loss=0.2606, pruned_loss=0.05864, over 21417.00 frames. ], tot_loss[loss=0.2119, simple_loss=0.2924, pruned_loss=0.06569, over 4264432.90 frames. ], batch size: 194, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:57:09,877 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1916076.0, ans=0.0 2023-06-27 22:57:25,215 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.37 vs. limit=12.0 2023-06-27 22:58:40,445 INFO [train.py:996] (0/4) Epoch 11, batch 14450, loss[loss=0.1992, simple_loss=0.2663, pruned_loss=0.06602, over 21238.00 frames. ], tot_loss[loss=0.2083, simple_loss=0.2859, pruned_loss=0.06532, over 4252474.17 frames. ], batch size: 159, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:58:41,024 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1916376.0, ans=0.05 2023-06-27 22:58:50,302 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.736e+02 6.824e+02 1.004e+03 1.771e+03 3.739e+03, threshold=2.008e+03, percent-clipped=15.0 2023-06-27 22:59:20,230 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1916496.0, ans=0.0 2023-06-27 22:59:53,935 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1916556.0, ans=0.125 2023-06-27 22:59:53,989 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1916556.0, ans=0.125 2023-06-27 23:00:09,391 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.44 vs. limit=12.0 2023-06-27 23:00:10,591 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1916616.0, ans=0.0 2023-06-27 23:00:19,656 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.23 vs. limit=10.0 2023-06-27 23:00:21,617 INFO [train.py:996] (0/4) Epoch 11, batch 14500, loss[loss=0.1927, simple_loss=0.2782, pruned_loss=0.05354, over 21351.00 frames. ], tot_loss[loss=0.2067, simple_loss=0.2826, pruned_loss=0.06542, over 4244662.79 frames. ], batch size: 131, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 23:00:29,136 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.91 vs. limit=15.0 2023-06-27 23:00:48,322 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1916736.0, ans=0.125 2023-06-27 23:01:57,076 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.46 vs. limit=15.0 2023-06-27 23:02:04,591 INFO [train.py:996] (0/4) Epoch 11, batch 14550, loss[loss=0.2052, simple_loss=0.2847, pruned_loss=0.06288, over 21273.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2872, pruned_loss=0.06656, over 4250471.66 frames. ], batch size: 548, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 23:02:14,903 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.488e+02 6.849e+02 9.219e+02 1.443e+03 4.541e+03, threshold=1.844e+03, percent-clipped=15.0 2023-06-27 23:02:25,700 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1917036.0, ans=0.125 2023-06-27 23:03:48,451 INFO [train.py:996] (0/4) Epoch 11, batch 14600, loss[loss=0.2219, simple_loss=0.3045, pruned_loss=0.06964, over 21691.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2943, pruned_loss=0.07026, over 4259787.23 frames. ], batch size: 351, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:03:49,091 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1917276.0, ans=0.015 2023-06-27 23:04:00,754 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1917276.0, ans=0.125 2023-06-27 23:04:30,383 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1917396.0, ans=0.2 2023-06-27 23:04:56,879 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1917456.0, ans=0.0 2023-06-27 23:05:18,505 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1917516.0, ans=0.0 2023-06-27 23:05:31,380 INFO [train.py:996] (0/4) Epoch 11, batch 14650, loss[loss=0.2127, simple_loss=0.3025, pruned_loss=0.06149, over 21800.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.2968, pruned_loss=0.06987, over 4256095.35 frames. ], batch size: 351, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:05:39,912 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1917576.0, ans=0.125 2023-06-27 23:05:45,915 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.373e+02 8.214e+02 1.374e+03 1.981e+03 3.761e+03, threshold=2.748e+03, percent-clipped=28.0 2023-06-27 23:06:24,207 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1917696.0, ans=0.015 2023-06-27 23:06:53,108 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1917756.0, ans=0.0 2023-06-27 23:07:19,687 INFO [train.py:996] (0/4) Epoch 11, batch 14700, loss[loss=0.1823, simple_loss=0.2707, pruned_loss=0.04695, over 21370.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2917, pruned_loss=0.06578, over 4247839.31 frames. ], batch size: 194, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:07:35,416 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1917936.0, ans=0.0 2023-06-27 23:07:37,231 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1917936.0, ans=0.07 2023-06-27 23:08:03,121 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=1917996.0, ans=0.05 2023-06-27 23:08:32,296 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1918056.0, ans=0.0 2023-06-27 23:08:37,582 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1918056.0, ans=0.035 2023-06-27 23:09:04,343 INFO [train.py:996] (0/4) Epoch 11, batch 14750, loss[loss=0.2207, simple_loss=0.2997, pruned_loss=0.07084, over 21595.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2957, pruned_loss=0.06759, over 4246856.27 frames. ], batch size: 230, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:09:14,865 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.274e+02 6.652e+02 9.504e+02 1.333e+03 3.432e+03, threshold=1.901e+03, percent-clipped=1.0 2023-06-27 23:09:51,964 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.36 vs. limit=12.0 2023-06-27 23:10:32,367 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1918416.0, ans=0.1 2023-06-27 23:10:48,846 INFO [train.py:996] (0/4) Epoch 11, batch 14800, loss[loss=0.2011, simple_loss=0.2837, pruned_loss=0.05928, over 21694.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.3063, pruned_loss=0.07192, over 4249959.46 frames. ], batch size: 124, lr: 2.65e-03, grad_scale: 32.0 2023-06-27 23:11:08,345 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1918476.0, ans=0.125 2023-06-27 23:11:10,107 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1918476.0, ans=0.125 2023-06-27 23:11:21,479 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1918536.0, ans=0.0 2023-06-27 23:11:47,283 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1918596.0, ans=0.0 2023-06-27 23:11:52,205 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 23:12:23,347 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1918716.0, ans=0.1 2023-06-27 23:12:43,480 INFO [train.py:996] (0/4) Epoch 11, batch 14850, loss[loss=0.215, simple_loss=0.2934, pruned_loss=0.06826, over 21709.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.3014, pruned_loss=0.07166, over 4257634.87 frames. ], batch size: 282, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:13:00,843 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.760e+02 8.785e+02 1.252e+03 1.775e+03 4.444e+03, threshold=2.503e+03, percent-clipped=22.0 2023-06-27 23:13:38,288 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1918896.0, ans=0.125 2023-06-27 23:13:41,834 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1918896.0, ans=0.125 2023-06-27 23:14:29,649 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1919016.0, ans=0.0 2023-06-27 23:14:31,495 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1919076.0, ans=0.125 2023-06-27 23:14:32,369 INFO [train.py:996] (0/4) Epoch 11, batch 14900, loss[loss=0.2855, simple_loss=0.351, pruned_loss=0.11, over 21455.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.3016, pruned_loss=0.07229, over 4254675.70 frames. ], batch size: 471, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:14:55,765 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1919136.0, ans=0.2 2023-06-27 23:16:16,124 INFO [train.py:996] (0/4) Epoch 11, batch 14950, loss[loss=0.2575, simple_loss=0.3348, pruned_loss=0.09006, over 21570.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.3028, pruned_loss=0.07176, over 4263095.05 frames. ], batch size: 509, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:16:27,766 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.435e+02 7.906e+02 1.198e+03 1.645e+03 4.202e+03, threshold=2.397e+03, percent-clipped=8.0 2023-06-27 23:17:47,318 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1919616.0, ans=0.2 2023-06-27 23:17:55,736 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1919616.0, ans=0.2 2023-06-27 23:17:58,255 INFO [train.py:996] (0/4) Epoch 11, batch 15000, loss[loss=0.2018, simple_loss=0.2823, pruned_loss=0.06063, over 21677.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.3049, pruned_loss=0.07339, over 4265679.04 frames. ], batch size: 263, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:17:58,256 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-27 23:18:18,452 INFO [train.py:1028] (0/4) Epoch 11, validation: loss=0.2534, simple_loss=0.3437, pruned_loss=0.08155, over 1796401.00 frames. 2023-06-27 23:18:18,453 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23714MB 2023-06-27 23:18:47,793 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.70 vs. limit=15.0 2023-06-27 23:19:26,828 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1919856.0, ans=0.125 2023-06-27 23:19:42,248 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1919916.0, ans=0.0 2023-06-27 23:20:02,921 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.74 vs. limit=15.0 2023-06-27 23:20:03,428 INFO [train.py:996] (0/4) Epoch 11, batch 15050, loss[loss=0.2084, simple_loss=0.2972, pruned_loss=0.05984, over 21364.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.3054, pruned_loss=0.07414, over 4260428.64 frames. ], batch size: 211, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:20:05,835 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1919976.0, ans=0.0 2023-06-27 23:20:08,867 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-320000.pt 2023-06-27 23:20:11,532 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.20 vs. limit=12.0 2023-06-27 23:20:17,264 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.821e+02 6.928e+02 9.435e+02 1.433e+03 3.639e+03, threshold=1.887e+03, percent-clipped=3.0 2023-06-27 23:20:59,295 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1920096.0, ans=0.0 2023-06-27 23:21:01,102 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1920096.0, ans=0.125 2023-06-27 23:21:12,836 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1920156.0, ans=0.125 2023-06-27 23:21:21,664 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1920156.0, ans=0.2 2023-06-27 23:21:49,510 INFO [train.py:996] (0/4) Epoch 11, batch 15100, loss[loss=0.2533, simple_loss=0.335, pruned_loss=0.08584, over 21648.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3078, pruned_loss=0.07387, over 4267620.02 frames. ], batch size: 389, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:21:55,056 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1920276.0, ans=0.125 2023-06-27 23:23:30,440 INFO [train.py:996] (0/4) Epoch 11, batch 15150, loss[loss=0.1896, simple_loss=0.2595, pruned_loss=0.0598, over 21337.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.3038, pruned_loss=0.0739, over 4267928.82 frames. ], batch size: 131, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:23:46,525 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.704e+02 7.406e+02 1.033e+03 1.604e+03 3.709e+03, threshold=2.066e+03, percent-clipped=14.0 2023-06-27 23:24:57,050 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.50 vs. limit=15.0 2023-06-27 23:25:12,945 INFO [train.py:996] (0/4) Epoch 11, batch 15200, loss[loss=0.1939, simple_loss=0.2766, pruned_loss=0.05563, over 21817.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2954, pruned_loss=0.06992, over 4259711.23 frames. ], batch size: 372, lr: 2.65e-03, grad_scale: 32.0 2023-06-27 23:25:19,945 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1920876.0, ans=0.125 2023-06-27 23:26:50,154 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1921116.0, ans=0.125 2023-06-27 23:27:01,193 INFO [train.py:996] (0/4) Epoch 11, batch 15250, loss[loss=0.2681, simple_loss=0.4023, pruned_loss=0.06696, over 19743.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2908, pruned_loss=0.06835, over 4258174.13 frames. ], batch size: 702, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:27:23,401 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.342e+02 7.886e+02 1.142e+03 1.659e+03 3.992e+03, threshold=2.285e+03, percent-clipped=18.0 2023-06-27 23:27:29,004 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1921236.0, ans=0.2 2023-06-27 23:27:33,829 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1921236.0, ans=0.1 2023-06-27 23:28:19,513 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1921356.0, ans=0.125 2023-06-27 23:28:42,800 INFO [train.py:996] (0/4) Epoch 11, batch 15300, loss[loss=0.2388, simple_loss=0.3151, pruned_loss=0.08118, over 21321.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2923, pruned_loss=0.07045, over 4267878.26 frames. ], batch size: 159, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:29:07,409 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1921536.0, ans=0.125 2023-06-27 23:30:19,311 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1921716.0, ans=0.2 2023-06-27 23:30:24,105 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1921776.0, ans=0.125 2023-06-27 23:30:29,667 INFO [train.py:996] (0/4) Epoch 11, batch 15350, loss[loss=0.2137, simple_loss=0.3135, pruned_loss=0.05701, over 21837.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.2977, pruned_loss=0.07265, over 4273475.45 frames. ], batch size: 316, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:30:44,747 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1921776.0, ans=0.0 2023-06-27 23:30:47,340 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.758e+02 7.708e+02 1.113e+03 1.589e+03 3.642e+03, threshold=2.225e+03, percent-clipped=6.0 2023-06-27 23:30:54,829 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 23:31:03,652 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.54 vs. limit=15.0 2023-06-27 23:31:14,424 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1921896.0, ans=0.0 2023-06-27 23:31:28,920 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1921956.0, ans=0.0 2023-06-27 23:31:41,813 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1921956.0, ans=0.125 2023-06-27 23:32:05,707 INFO [train.py:996] (0/4) Epoch 11, batch 15400, loss[loss=0.1921, simple_loss=0.2761, pruned_loss=0.05411, over 21864.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.2985, pruned_loss=0.07132, over 4272575.31 frames. ], batch size: 298, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:32:14,181 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1922076.0, ans=0.04949747468305833 2023-06-27 23:33:16,619 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1922256.0, ans=0.125 2023-06-27 23:33:26,514 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1922316.0, ans=0.125 2023-06-27 23:33:47,728 INFO [train.py:996] (0/4) Epoch 11, batch 15450, loss[loss=0.2186, simple_loss=0.3202, pruned_loss=0.05846, over 21688.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.2979, pruned_loss=0.07112, over 4265294.48 frames. ], batch size: 389, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:34:10,725 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.367e+02 6.968e+02 9.606e+02 1.449e+03 2.613e+03, threshold=1.921e+03, percent-clipped=5.0 2023-06-27 23:34:13,085 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1922436.0, ans=0.0 2023-06-27 23:34:48,572 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.41 vs. limit=15.0 2023-06-27 23:35:34,378 INFO [train.py:996] (0/4) Epoch 11, batch 15500, loss[loss=0.2445, simple_loss=0.3282, pruned_loss=0.08041, over 21818.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.3012, pruned_loss=0.07102, over 4272816.20 frames. ], batch size: 124, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:35:48,171 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1922676.0, ans=0.125 2023-06-27 23:36:23,618 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.16 vs. limit=22.5 2023-06-27 23:37:21,919 INFO [train.py:996] (0/4) Epoch 11, batch 15550, loss[loss=0.1903, simple_loss=0.2707, pruned_loss=0.05493, over 21670.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.2998, pruned_loss=0.06872, over 4269809.29 frames. ], batch size: 298, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:37:30,972 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1922976.0, ans=0.0 2023-06-27 23:37:34,970 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.488e+02 6.660e+02 9.717e+02 1.306e+03 2.635e+03, threshold=1.943e+03, percent-clipped=6.0 2023-06-27 23:37:50,277 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1923036.0, ans=0.125 2023-06-27 23:38:15,285 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1923156.0, ans=0.125 2023-06-27 23:38:24,413 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.93 vs. limit=22.5 2023-06-27 23:38:32,053 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1923156.0, ans=0.035 2023-06-27 23:38:40,400 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1923216.0, ans=0.125 2023-06-27 23:38:50,423 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1923216.0, ans=0.125 2023-06-27 23:39:03,935 INFO [train.py:996] (0/4) Epoch 11, batch 15600, loss[loss=0.233, simple_loss=0.304, pruned_loss=0.08096, over 21494.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2948, pruned_loss=0.06772, over 4271914.06 frames. ], batch size: 441, lr: 2.65e-03, grad_scale: 32.0 2023-06-27 23:40:01,638 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1923456.0, ans=0.125 2023-06-27 23:40:38,244 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.43 vs. limit=15.0 2023-06-27 23:40:45,235 INFO [train.py:996] (0/4) Epoch 11, batch 15650, loss[loss=0.23, simple_loss=0.2936, pruned_loss=0.08324, over 20100.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2933, pruned_loss=0.06731, over 4274093.71 frames. ], batch size: 703, lr: 2.65e-03, grad_scale: 32.0 2023-06-27 23:41:03,339 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.415e+02 8.795e+02 1.290e+03 1.896e+03 3.786e+03, threshold=2.580e+03, percent-clipped=24.0 2023-06-27 23:42:02,523 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.41 vs. limit=6.0 2023-06-27 23:42:23,186 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1923816.0, ans=0.125 2023-06-27 23:42:27,306 INFO [train.py:996] (0/4) Epoch 11, batch 15700, loss[loss=0.1849, simple_loss=0.2647, pruned_loss=0.05254, over 21375.00 frames. ], tot_loss[loss=0.2103, simple_loss=0.2886, pruned_loss=0.06596, over 4275826.03 frames. ], batch size: 194, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:42:47,561 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1923936.0, ans=0.125 2023-06-27 23:42:49,667 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.79 vs. limit=22.5 2023-06-27 23:42:57,415 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1923936.0, ans=0.1 2023-06-27 23:43:06,965 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1923996.0, ans=0.0 2023-06-27 23:43:18,630 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1923996.0, ans=0.125 2023-06-27 23:43:32,487 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.15 vs. limit=15.0 2023-06-27 23:44:05,450 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1924116.0, ans=0.125 2023-06-27 23:44:08,200 INFO [train.py:996] (0/4) Epoch 11, batch 15750, loss[loss=0.1869, simple_loss=0.2845, pruned_loss=0.04471, over 20791.00 frames. ], tot_loss[loss=0.2069, simple_loss=0.2837, pruned_loss=0.06507, over 4260004.65 frames. ], batch size: 608, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:44:27,434 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.257e+02 5.974e+02 8.242e+02 1.132e+03 2.648e+03, threshold=1.648e+03, percent-clipped=1.0 2023-06-27 23:44:33,499 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1924236.0, ans=10.0 2023-06-27 23:44:38,750 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.28 vs. limit=15.0 2023-06-27 23:44:41,436 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1924236.0, ans=0.0 2023-06-27 23:45:49,089 INFO [train.py:996] (0/4) Epoch 11, batch 15800, loss[loss=0.196, simple_loss=0.2691, pruned_loss=0.06146, over 21654.00 frames. ], tot_loss[loss=0.204, simple_loss=0.2791, pruned_loss=0.0645, over 4255880.25 frames. ], batch size: 247, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:46:09,329 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1924536.0, ans=0.0 2023-06-27 23:46:54,651 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1924656.0, ans=0.125 2023-06-27 23:47:27,951 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1924716.0, ans=0.0 2023-06-27 23:47:32,274 INFO [train.py:996] (0/4) Epoch 11, batch 15850, loss[loss=0.2208, simple_loss=0.2897, pruned_loss=0.07592, over 21331.00 frames. ], tot_loss[loss=0.2062, simple_loss=0.2806, pruned_loss=0.06592, over 4258128.14 frames. ], batch size: 471, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:47:52,235 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.544e+02 6.551e+02 9.403e+02 1.336e+03 2.589e+03, threshold=1.881e+03, percent-clipped=10.0 2023-06-27 23:47:54,320 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1924836.0, ans=0.0 2023-06-27 23:49:15,201 INFO [train.py:996] (0/4) Epoch 11, batch 15900, loss[loss=0.2146, simple_loss=0.2951, pruned_loss=0.06708, over 21312.00 frames. ], tot_loss[loss=0.2048, simple_loss=0.2778, pruned_loss=0.06595, over 4259258.46 frames. ], batch size: 160, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:49:17,386 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1925076.0, ans=0.2 2023-06-27 23:49:35,721 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.94 vs. limit=15.0 2023-06-27 23:49:52,101 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1925196.0, ans=0.035 2023-06-27 23:50:04,585 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.63 vs. limit=15.0 2023-06-27 23:50:44,674 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1925316.0, ans=0.07 2023-06-27 23:50:45,469 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.66 vs. limit=8.0 2023-06-27 23:50:57,570 INFO [train.py:996] (0/4) Epoch 11, batch 15950, loss[loss=0.2547, simple_loss=0.3359, pruned_loss=0.08675, over 21672.00 frames. ], tot_loss[loss=0.2044, simple_loss=0.2801, pruned_loss=0.06439, over 4266936.68 frames. ], batch size: 441, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:51:09,698 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1925376.0, ans=0.2 2023-06-27 23:51:17,132 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.39 vs. limit=6.0 2023-06-27 23:51:17,432 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.351e+02 7.372e+02 1.063e+03 1.688e+03 3.100e+03, threshold=2.125e+03, percent-clipped=16.0 2023-06-27 23:51:51,931 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1925556.0, ans=0.125 2023-06-27 23:52:02,949 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1925556.0, ans=0.1 2023-06-27 23:52:02,989 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 23:52:30,997 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1925616.0, ans=0.0 2023-06-27 23:52:40,050 INFO [train.py:996] (0/4) Epoch 11, batch 16000, loss[loss=0.1722, simple_loss=0.253, pruned_loss=0.04576, over 21371.00 frames. ], tot_loss[loss=0.2049, simple_loss=0.2831, pruned_loss=0.06335, over 4267290.61 frames. ], batch size: 194, lr: 2.65e-03, grad_scale: 32.0 2023-06-27 23:52:44,419 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1925676.0, ans=0.1 2023-06-27 23:53:06,065 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 23:53:33,168 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.05 vs. limit=6.0 2023-06-27 23:53:48,865 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1925856.0, ans=0.0 2023-06-27 23:54:11,597 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1925916.0, ans=0.125 2023-06-27 23:54:17,236 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.04 vs. limit=6.0 2023-06-27 23:54:17,258 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.92 vs. limit=22.5 2023-06-27 23:54:17,616 INFO [train.py:996] (0/4) Epoch 11, batch 16050, loss[loss=0.2262, simple_loss=0.3314, pruned_loss=0.0605, over 21627.00 frames. ], tot_loss[loss=0.2047, simple_loss=0.2861, pruned_loss=0.06166, over 4265171.34 frames. ], batch size: 263, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:54:22,937 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1925976.0, ans=0.125 2023-06-27 23:54:38,778 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1925976.0, ans=0.125 2023-06-27 23:54:43,012 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.281e+02 6.057e+02 9.389e+02 1.429e+03 3.235e+03, threshold=1.878e+03, percent-clipped=6.0 2023-06-27 23:55:12,569 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1926096.0, ans=0.125 2023-06-27 23:55:30,131 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1926156.0, ans=0.05 2023-06-27 23:55:41,304 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1926216.0, ans=0.1 2023-06-27 23:55:41,332 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1926216.0, ans=0.125 2023-06-27 23:55:52,304 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1926216.0, ans=0.125 2023-06-27 23:55:55,753 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1926276.0, ans=0.125 2023-06-27 23:55:55,842 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 23:55:56,784 INFO [train.py:996] (0/4) Epoch 11, batch 16100, loss[loss=0.2127, simple_loss=0.2932, pruned_loss=0.06607, over 21581.00 frames. ], tot_loss[loss=0.2083, simple_loss=0.2912, pruned_loss=0.06266, over 4265091.66 frames. ], batch size: 131, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:56:47,283 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1926396.0, ans=0.125 2023-06-27 23:57:07,314 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1926456.0, ans=0.125 2023-06-27 23:57:37,661 INFO [train.py:996] (0/4) Epoch 11, batch 16150, loss[loss=0.2025, simple_loss=0.2822, pruned_loss=0.06134, over 21841.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2909, pruned_loss=0.06477, over 4275906.16 frames. ], batch size: 298, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:57:38,279 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1926576.0, ans=0.125 2023-06-27 23:58:03,757 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.374e+02 7.210e+02 1.100e+03 1.545e+03 2.941e+03, threshold=2.200e+03, percent-clipped=14.0 2023-06-27 23:58:04,529 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1926636.0, ans=0.125 2023-06-27 23:59:01,212 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.42 vs. limit=15.0 2023-06-27 23:59:15,177 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1926816.0, ans=0.0 2023-06-27 23:59:19,448 INFO [train.py:996] (0/4) Epoch 11, batch 16200, loss[loss=0.2566, simple_loss=0.3284, pruned_loss=0.09242, over 21242.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2953, pruned_loss=0.06676, over 4283185.70 frames. ], batch size: 143, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:59:34,672 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1926876.0, ans=0.125 2023-06-28 00:00:45,443 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1927116.0, ans=0.125 2023-06-28 00:00:47,247 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1927116.0, ans=0.125 2023-06-28 00:00:52,169 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1927116.0, ans=0.125 2023-06-28 00:00:52,907 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.21 vs. limit=6.0 2023-06-28 00:01:06,266 INFO [train.py:996] (0/4) Epoch 11, batch 16250, loss[loss=0.2093, simple_loss=0.2933, pruned_loss=0.06262, over 21603.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2946, pruned_loss=0.06666, over 4284863.20 frames. ], batch size: 389, lr: 2.65e-03, grad_scale: 16.0 2023-06-28 00:01:18,388 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1927176.0, ans=0.2 2023-06-28 00:01:23,906 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.70 vs. limit=15.0 2023-06-28 00:01:27,646 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.408e+02 8.189e+02 1.172e+03 1.830e+03 4.029e+03, threshold=2.343e+03, percent-clipped=14.0 2023-06-28 00:02:24,305 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 00:02:34,139 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1927416.0, ans=0.2 2023-06-28 00:02:40,730 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1927416.0, ans=0.0 2023-06-28 00:02:53,056 INFO [train.py:996] (0/4) Epoch 11, batch 16300, loss[loss=0.2129, simple_loss=0.2981, pruned_loss=0.06382, over 21387.00 frames. ], tot_loss[loss=0.2074, simple_loss=0.2883, pruned_loss=0.06327, over 4276593.52 frames. ], batch size: 471, lr: 2.65e-03, grad_scale: 16.0 2023-06-28 00:03:45,648 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1927596.0, ans=0.0 2023-06-28 00:04:00,796 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1927656.0, ans=0.0 2023-06-28 00:04:22,886 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1927716.0, ans=0.125 2023-06-28 00:04:37,032 INFO [train.py:996] (0/4) Epoch 11, batch 16350, loss[loss=0.2154, simple_loss=0.3003, pruned_loss=0.06528, over 21784.00 frames. ], tot_loss[loss=0.2085, simple_loss=0.2884, pruned_loss=0.06436, over 4278042.82 frames. ], batch size: 118, lr: 2.65e-03, grad_scale: 16.0 2023-06-28 00:04:49,251 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1927776.0, ans=0.04949747468305833 2023-06-28 00:04:53,543 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.626e+02 5.951e+02 8.785e+02 1.347e+03 2.273e+03, threshold=1.757e+03, percent-clipped=0.0 2023-06-28 00:04:59,733 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1927836.0, ans=0.125 2023-06-28 00:05:23,015 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1927896.0, ans=0.125 2023-06-28 00:05:40,647 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1927956.0, ans=0.0 2023-06-28 00:05:42,329 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1927956.0, ans=0.2 2023-06-28 00:06:05,936 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1928016.0, ans=0.125 2023-06-28 00:06:15,111 INFO [train.py:996] (0/4) Epoch 11, batch 16400, loss[loss=0.2126, simple_loss=0.2835, pruned_loss=0.07088, over 21891.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2923, pruned_loss=0.06668, over 4283915.59 frames. ], batch size: 371, lr: 2.65e-03, grad_scale: 32.0 2023-06-28 00:07:20,934 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.35 vs. limit=15.0 2023-06-28 00:07:22,745 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.39 vs. limit=15.0 2023-06-28 00:07:33,454 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1928316.0, ans=0.125 2023-06-28 00:07:56,949 INFO [train.py:996] (0/4) Epoch 11, batch 16450, loss[loss=0.2018, simple_loss=0.2712, pruned_loss=0.06618, over 21660.00 frames. ], tot_loss[loss=0.2119, simple_loss=0.2908, pruned_loss=0.06649, over 4290422.32 frames. ], batch size: 230, lr: 2.65e-03, grad_scale: 16.0 2023-06-28 00:08:16,237 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.401e+02 6.665e+02 9.796e+02 1.595e+03 2.942e+03, threshold=1.959e+03, percent-clipped=15.0 2023-06-28 00:08:37,135 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=1928496.0, ans=0.5 2023-06-28 00:08:43,932 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1928496.0, ans=0.125 2023-06-28 00:08:43,948 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1928496.0, ans=0.125 2023-06-28 00:08:58,633 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1928556.0, ans=0.2 2023-06-28 00:09:41,707 INFO [train.py:996] (0/4) Epoch 11, batch 16500, loss[loss=0.1818, simple_loss=0.2531, pruned_loss=0.05526, over 21656.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2889, pruned_loss=0.06673, over 4280585.53 frames. ], batch size: 263, lr: 2.65e-03, grad_scale: 16.0 2023-06-28 00:09:52,887 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 00:10:11,207 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1928736.0, ans=0.125 2023-06-28 00:10:48,946 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.69 vs. limit=15.0 2023-06-28 00:11:26,486 INFO [train.py:996] (0/4) Epoch 11, batch 16550, loss[loss=0.2091, simple_loss=0.278, pruned_loss=0.07014, over 21313.00 frames. ], tot_loss[loss=0.2078, simple_loss=0.2864, pruned_loss=0.06459, over 4275054.47 frames. ], batch size: 159, lr: 2.65e-03, grad_scale: 16.0 2023-06-28 00:11:36,686 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1928976.0, ans=0.2 2023-06-28 00:11:42,195 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1928976.0, ans=0.0 2023-06-28 00:11:47,828 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.54 vs. limit=12.0 2023-06-28 00:11:50,016 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.361e+02 7.345e+02 1.277e+03 1.917e+03 4.181e+03, threshold=2.555e+03, percent-clipped=23.0 2023-06-28 00:12:12,558 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1929096.0, ans=0.1 2023-06-28 00:13:05,937 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1929216.0, ans=0.125 2023-06-28 00:13:15,364 INFO [train.py:996] (0/4) Epoch 11, batch 16600, loss[loss=0.2382, simple_loss=0.3291, pruned_loss=0.07372, over 21806.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.2948, pruned_loss=0.06721, over 4273069.89 frames. ], batch size: 124, lr: 2.65e-03, grad_scale: 16.0 2023-06-28 00:13:19,708 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1929276.0, ans=0.2 2023-06-28 00:14:21,088 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1929456.0, ans=10.0 2023-06-28 00:14:46,837 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1929516.0, ans=0.0 2023-06-28 00:14:48,533 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1929516.0, ans=0.125 2023-06-28 00:15:00,042 INFO [train.py:996] (0/4) Epoch 11, batch 16650, loss[loss=0.2441, simple_loss=0.3221, pruned_loss=0.08303, over 21571.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.3053, pruned_loss=0.06951, over 4273865.55 frames. ], batch size: 389, lr: 2.65e-03, grad_scale: 16.0 2023-06-28 00:15:10,664 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1929576.0, ans=0.2 2023-06-28 00:15:28,779 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.761e+02 8.064e+02 1.116e+03 1.585e+03 3.216e+03, threshold=2.231e+03, percent-clipped=5.0 2023-06-28 00:15:31,132 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1929636.0, ans=0.125 2023-06-28 00:15:39,489 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1929636.0, ans=0.0 2023-06-28 00:15:50,076 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.12 vs. limit=10.0 2023-06-28 00:16:22,305 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1929756.0, ans=0.05 2023-06-28 00:16:26,346 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=13.19 vs. limit=15.0 2023-06-28 00:16:47,442 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1929816.0, ans=0.2 2023-06-28 00:16:50,150 INFO [train.py:996] (0/4) Epoch 11, batch 16700, loss[loss=0.2351, simple_loss=0.3385, pruned_loss=0.06579, over 21266.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.3061, pruned_loss=0.07017, over 4270977.32 frames. ], batch size: 549, lr: 2.65e-03, grad_scale: 16.0 2023-06-28 00:16:50,901 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1929876.0, ans=0.125 2023-06-28 00:17:48,614 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1929996.0, ans=0.0 2023-06-28 00:18:15,169 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.22 vs. limit=15.0 2023-06-28 00:18:44,207 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1930116.0, ans=0.125 2023-06-28 00:18:47,092 INFO [train.py:996] (0/4) Epoch 11, batch 16750, loss[loss=0.2279, simple_loss=0.3188, pruned_loss=0.06855, over 21758.00 frames. ], tot_loss[loss=0.227, simple_loss=0.3087, pruned_loss=0.07266, over 4270903.34 frames. ], batch size: 332, lr: 2.65e-03, grad_scale: 16.0 2023-06-28 00:18:49,397 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1930176.0, ans=0.0 2023-06-28 00:19:12,043 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.860e+02 6.994e+02 8.979e+02 1.342e+03 3.526e+03, threshold=1.796e+03, percent-clipped=9.0 2023-06-28 00:19:19,974 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1930236.0, ans=0.125 2023-06-28 00:19:23,446 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=1930236.0, ans=0.05 2023-06-28 00:20:11,942 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1930356.0, ans=0.2 2023-06-28 00:20:20,651 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.37 vs. limit=15.0 2023-06-28 00:20:37,353 INFO [train.py:996] (0/4) Epoch 11, batch 16800, loss[loss=0.1554, simple_loss=0.2035, pruned_loss=0.05364, over 17012.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.3123, pruned_loss=0.07247, over 4260638.05 frames. ], batch size: 61, lr: 2.65e-03, grad_scale: 32.0 2023-06-28 00:21:27,399 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 00:22:18,739 INFO [train.py:996] (0/4) Epoch 11, batch 16850, loss[loss=0.2279, simple_loss=0.3139, pruned_loss=0.07091, over 21956.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.3088, pruned_loss=0.07205, over 4263429.32 frames. ], batch size: 113, lr: 2.65e-03, grad_scale: 16.0 2023-06-28 00:22:38,611 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.042e+02 8.354e+02 1.397e+03 2.191e+03 5.653e+03, threshold=2.793e+03, percent-clipped=35.0 2023-06-28 00:23:06,635 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1930896.0, ans=0.015 2023-06-28 00:23:25,596 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1930956.0, ans=0.07 2023-06-28 00:23:33,866 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1930956.0, ans=0.0 2023-06-28 00:24:00,836 INFO [train.py:996] (0/4) Epoch 11, batch 16900, loss[loss=0.1782, simple_loss=0.2703, pruned_loss=0.04309, over 21843.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.3029, pruned_loss=0.07033, over 4267893.82 frames. ], batch size: 371, lr: 2.65e-03, grad_scale: 16.0 2023-06-28 00:24:22,496 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1931136.0, ans=0.0 2023-06-28 00:25:01,001 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1931256.0, ans=0.2 2023-06-28 00:25:27,087 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1931316.0, ans=0.125 2023-06-28 00:25:41,098 INFO [train.py:996] (0/4) Epoch 11, batch 16950, loss[loss=0.1838, simple_loss=0.265, pruned_loss=0.0513, over 21812.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.2966, pruned_loss=0.06953, over 4269500.79 frames. ], batch size: 298, lr: 2.65e-03, grad_scale: 16.0 2023-06-28 00:26:00,772 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.390e+02 6.361e+02 9.262e+02 1.143e+03 1.974e+03, threshold=1.852e+03, percent-clipped=0.0 2023-06-28 00:26:06,707 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1931436.0, ans=0.05 2023-06-28 00:26:12,310 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.70 vs. limit=15.0 2023-06-28 00:26:35,740 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1931496.0, ans=0.125 2023-06-28 00:26:44,118 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1931556.0, ans=0.125 2023-06-28 00:26:46,306 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.71 vs. limit=15.0 2023-06-28 00:27:22,674 INFO [train.py:996] (0/4) Epoch 11, batch 17000, loss[loss=0.2206, simple_loss=0.2746, pruned_loss=0.08326, over 20116.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2936, pruned_loss=0.07028, over 4271858.43 frames. ], batch size: 702, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 00:27:40,795 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1931736.0, ans=0.0 2023-06-28 00:27:49,195 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1931736.0, ans=0.125 2023-06-28 00:29:06,154 INFO [train.py:996] (0/4) Epoch 11, batch 17050, loss[loss=0.2181, simple_loss=0.3042, pruned_loss=0.06602, over 21649.00 frames. ], tot_loss[loss=0.2219, simple_loss=0.299, pruned_loss=0.07234, over 4280702.08 frames. ], batch size: 263, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 00:29:26,239 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.012e+02 8.433e+02 1.501e+03 2.176e+03 5.028e+03, threshold=3.003e+03, percent-clipped=35.0 2023-06-28 00:29:26,814 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 00:29:54,177 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1932096.0, ans=0.125 2023-06-28 00:30:18,107 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1932156.0, ans=0.0 2023-06-28 00:30:25,337 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.96 vs. limit=15.0 2023-06-28 00:30:25,433 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.96 vs. limit=22.5 2023-06-28 00:30:40,846 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1932216.0, ans=0.125 2023-06-28 00:30:46,880 INFO [train.py:996] (0/4) Epoch 11, batch 17100, loss[loss=0.2416, simple_loss=0.3463, pruned_loss=0.06843, over 20993.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.2992, pruned_loss=0.07314, over 4288925.62 frames. ], batch size: 607, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 00:30:49,331 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1932276.0, ans=0.07 2023-06-28 00:30:52,804 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1932276.0, ans=0.1 2023-06-28 00:32:16,985 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1932516.0, ans=0.0 2023-06-28 00:32:22,182 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1932516.0, ans=0.125 2023-06-28 00:32:29,944 INFO [train.py:996] (0/4) Epoch 11, batch 17150, loss[loss=0.2318, simple_loss=0.2952, pruned_loss=0.08416, over 21877.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2961, pruned_loss=0.0723, over 4291290.15 frames. ], batch size: 391, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 00:32:35,645 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1932576.0, ans=0.125 2023-06-28 00:32:54,500 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.748e+02 5.716e+02 7.652e+02 9.791e+02 2.028e+03, threshold=1.530e+03, percent-clipped=0.0 2023-06-28 00:33:09,750 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1932636.0, ans=0.125 2023-06-28 00:33:34,716 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1932756.0, ans=0.1 2023-06-28 00:33:47,829 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1932756.0, ans=0.2 2023-06-28 00:34:16,988 INFO [train.py:996] (0/4) Epoch 11, batch 17200, loss[loss=0.2616, simple_loss=0.3315, pruned_loss=0.09585, over 21487.00 frames. ], tot_loss[loss=0.2181, simple_loss=0.2942, pruned_loss=0.07102, over 4287453.88 frames. ], batch size: 471, lr: 2.64e-03, grad_scale: 32.0 2023-06-28 00:34:41,665 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.16 vs. limit=6.0 2023-06-28 00:35:03,874 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1932996.0, ans=0.1 2023-06-28 00:35:08,942 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1932996.0, ans=0.125 2023-06-28 00:35:11,361 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.20 vs. limit=15.0 2023-06-28 00:35:29,203 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1933056.0, ans=0.0 2023-06-28 00:35:39,591 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 00:35:50,771 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1933116.0, ans=0.125 2023-06-28 00:36:00,767 INFO [train.py:996] (0/4) Epoch 11, batch 17250, loss[loss=0.261, simple_loss=0.3293, pruned_loss=0.09639, over 21389.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.2977, pruned_loss=0.07296, over 4288980.95 frames. ], batch size: 471, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 00:36:32,859 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.884e+02 8.366e+02 1.182e+03 1.787e+03 4.360e+03, threshold=2.365e+03, percent-clipped=31.0 2023-06-28 00:36:45,452 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1933296.0, ans=0.0 2023-06-28 00:37:40,160 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1933416.0, ans=0.125 2023-06-28 00:37:41,916 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1933416.0, ans=0.0 2023-06-28 00:37:49,419 INFO [train.py:996] (0/4) Epoch 11, batch 17300, loss[loss=0.2315, simple_loss=0.311, pruned_loss=0.076, over 21758.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.306, pruned_loss=0.07594, over 4285557.76 frames. ], batch size: 332, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 00:38:21,418 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1933536.0, ans=0.2 2023-06-28 00:38:37,059 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.21 vs. limit=22.5 2023-06-28 00:38:59,923 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1933656.0, ans=0.1 2023-06-28 00:39:12,439 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1933716.0, ans=0.1 2023-06-28 00:39:39,422 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1933776.0, ans=0.1 2023-06-28 00:39:40,382 INFO [train.py:996] (0/4) Epoch 11, batch 17350, loss[loss=0.2451, simple_loss=0.336, pruned_loss=0.07711, over 21475.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3074, pruned_loss=0.07555, over 4285225.79 frames. ], batch size: 471, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 00:40:07,462 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.114e+02 8.299e+02 1.147e+03 1.835e+03 3.555e+03, threshold=2.294e+03, percent-clipped=8.0 2023-06-28 00:40:33,616 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1933896.0, ans=0.1 2023-06-28 00:40:39,518 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.39 vs. limit=15.0 2023-06-28 00:40:42,714 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.06 vs. limit=15.0 2023-06-28 00:41:03,673 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1934016.0, ans=0.0 2023-06-28 00:41:24,677 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1934076.0, ans=10.0 2023-06-28 00:41:25,718 INFO [train.py:996] (0/4) Epoch 11, batch 17400, loss[loss=0.2124, simple_loss=0.2728, pruned_loss=0.07597, over 20134.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.3036, pruned_loss=0.07256, over 4278346.94 frames. ], batch size: 707, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 00:41:26,531 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1934076.0, ans=0.0 2023-06-28 00:41:28,190 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1934076.0, ans=0.2 2023-06-28 00:41:29,898 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1934076.0, ans=0.125 2023-06-28 00:41:42,795 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1934076.0, ans=0.2 2023-06-28 00:42:28,078 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1934196.0, ans=0.125 2023-06-28 00:42:49,028 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.38 vs. limit=15.0 2023-06-28 00:43:06,939 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.01 vs. limit=15.0 2023-06-28 00:43:09,481 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1934316.0, ans=0.125 2023-06-28 00:43:13,925 INFO [train.py:996] (0/4) Epoch 11, batch 17450, loss[loss=0.1717, simple_loss=0.2618, pruned_loss=0.04076, over 21573.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.3017, pruned_loss=0.07063, over 4276913.20 frames. ], batch size: 230, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 00:43:41,603 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.482e+02 8.576e+02 1.354e+03 2.024e+03 4.305e+03, threshold=2.708e+03, percent-clipped=16.0 2023-06-28 00:44:27,808 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1934556.0, ans=0.125 2023-06-28 00:44:27,946 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1934556.0, ans=0.0 2023-06-28 00:44:47,346 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1934616.0, ans=0.1 2023-06-28 00:44:55,322 INFO [train.py:996] (0/4) Epoch 11, batch 17500, loss[loss=0.2431, simple_loss=0.3046, pruned_loss=0.09087, over 21703.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2962, pruned_loss=0.06799, over 4279093.85 frames. ], batch size: 508, lr: 2.64e-03, grad_scale: 8.0 2023-06-28 00:45:07,617 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1934676.0, ans=0.1 2023-06-28 00:45:11,267 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.37 vs. limit=10.0 2023-06-28 00:46:35,450 INFO [train.py:996] (0/4) Epoch 11, batch 17550, loss[loss=0.221, simple_loss=0.3073, pruned_loss=0.06739, over 21355.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2964, pruned_loss=0.06707, over 4289062.95 frames. ], batch size: 176, lr: 2.64e-03, grad_scale: 8.0 2023-06-28 00:47:02,918 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.317e+02 6.340e+02 7.775e+02 1.102e+03 1.869e+03, threshold=1.555e+03, percent-clipped=0.0 2023-06-28 00:47:16,531 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1935096.0, ans=0.0 2023-06-28 00:48:16,980 INFO [train.py:996] (0/4) Epoch 11, batch 17600, loss[loss=0.2384, simple_loss=0.3162, pruned_loss=0.08029, over 21373.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2992, pruned_loss=0.06742, over 4280822.11 frames. ], batch size: 176, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 00:48:17,728 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1935276.0, ans=0.07 2023-06-28 00:48:19,551 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1935276.0, ans=0.125 2023-06-28 00:48:23,020 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1935276.0, ans=0.125 2023-06-28 00:48:30,436 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.44 vs. limit=15.0 2023-06-28 00:48:41,412 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1935336.0, ans=0.125 2023-06-28 00:48:41,503 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1935336.0, ans=0.0 2023-06-28 00:49:10,119 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.24 vs. limit=15.0 2023-06-28 00:49:15,129 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.36 vs. limit=15.0 2023-06-28 00:50:01,078 INFO [train.py:996] (0/4) Epoch 11, batch 17650, loss[loss=0.2422, simple_loss=0.3235, pruned_loss=0.08044, over 21146.00 frames. ], tot_loss[loss=0.216, simple_loss=0.297, pruned_loss=0.0675, over 4266981.17 frames. ], batch size: 143, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 00:50:29,624 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.907e+02 7.332e+02 1.084e+03 1.896e+03 3.594e+03, threshold=2.168e+03, percent-clipped=34.0 2023-06-28 00:50:40,645 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1935636.0, ans=0.0 2023-06-28 00:51:11,025 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1935756.0, ans=0.125 2023-06-28 00:51:14,214 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1935756.0, ans=0.125 2023-06-28 00:51:49,573 INFO [train.py:996] (0/4) Epoch 11, batch 17700, loss[loss=0.2156, simple_loss=0.3067, pruned_loss=0.0623, over 20696.00 frames. ], tot_loss[loss=0.2108, simple_loss=0.2918, pruned_loss=0.06494, over 4272307.91 frames. ], batch size: 607, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 00:52:02,515 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1935876.0, ans=0.0 2023-06-28 00:52:29,785 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.75 vs. limit=15.0 2023-06-28 00:52:34,663 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1935996.0, ans=0.0 2023-06-28 00:52:44,400 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1935996.0, ans=0.2 2023-06-28 00:53:07,460 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 00:53:22,082 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1936116.0, ans=0.125 2023-06-28 00:53:30,614 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1936116.0, ans=0.0 2023-06-28 00:53:33,359 INFO [train.py:996] (0/4) Epoch 11, batch 17750, loss[loss=0.2444, simple_loss=0.3244, pruned_loss=0.08219, over 21388.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2993, pruned_loss=0.06787, over 4279844.39 frames. ], batch size: 549, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 00:53:51,854 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1936176.0, ans=0.04949747468305833 2023-06-28 00:54:01,415 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.937e+02 7.178e+02 1.077e+03 1.520e+03 3.336e+03, threshold=2.154e+03, percent-clipped=9.0 2023-06-28 00:54:24,502 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.39 vs. limit=15.0 2023-06-28 00:54:24,521 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.72 vs. limit=10.0 2023-06-28 00:55:00,760 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1936416.0, ans=0.1 2023-06-28 00:55:22,100 INFO [train.py:996] (0/4) Epoch 11, batch 17800, loss[loss=0.1771, simple_loss=0.2498, pruned_loss=0.05219, over 21657.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2978, pruned_loss=0.0672, over 4281702.17 frames. ], batch size: 112, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 00:56:15,612 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.35 vs. limit=10.0 2023-06-28 00:56:16,886 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1936596.0, ans=0.125 2023-06-28 00:57:05,740 INFO [train.py:996] (0/4) Epoch 11, batch 17850, loss[loss=0.3029, simple_loss=0.3677, pruned_loss=0.1191, over 21416.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.2984, pruned_loss=0.06775, over 4271045.51 frames. ], batch size: 471, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 00:57:18,669 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1936776.0, ans=0.125 2023-06-28 00:57:30,103 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1936836.0, ans=0.125 2023-06-28 00:57:34,255 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.504e+02 7.242e+02 1.057e+03 1.582e+03 3.438e+03, threshold=2.115e+03, percent-clipped=9.0 2023-06-28 00:58:48,619 INFO [train.py:996] (0/4) Epoch 11, batch 17900, loss[loss=0.2237, simple_loss=0.3087, pruned_loss=0.0693, over 21266.00 frames. ], tot_loss[loss=0.2219, simple_loss=0.3038, pruned_loss=0.06996, over 4271119.88 frames. ], batch size: 159, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 00:59:31,060 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1937196.0, ans=0.2 2023-06-28 00:59:45,458 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=6.24 vs. limit=22.5 2023-06-28 01:00:16,692 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1937316.0, ans=0.0 2023-06-28 01:00:20,360 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1937316.0, ans=0.125 2023-06-28 01:00:21,858 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1937316.0, ans=0.2 2023-06-28 01:00:36,442 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1937376.0, ans=0.0 2023-06-28 01:00:37,333 INFO [train.py:996] (0/4) Epoch 11, batch 17950, loss[loss=0.1998, simple_loss=0.2627, pruned_loss=0.0684, over 20098.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.3029, pruned_loss=0.06703, over 4269630.80 frames. ], batch size: 703, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:00:51,116 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1937376.0, ans=0.1 2023-06-28 01:01:09,590 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.263e+02 6.938e+02 9.459e+02 1.364e+03 3.127e+03, threshold=1.892e+03, percent-clipped=7.0 2023-06-28 01:01:11,912 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1937436.0, ans=0.125 2023-06-28 01:01:27,842 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1937496.0, ans=0.0 2023-06-28 01:01:39,479 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1937556.0, ans=0.125 2023-06-28 01:02:06,785 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.99 vs. limit=15.0 2023-06-28 01:02:21,758 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1937676.0, ans=0.0 2023-06-28 01:02:22,723 INFO [train.py:996] (0/4) Epoch 11, batch 18000, loss[loss=0.1822, simple_loss=0.2534, pruned_loss=0.05552, over 21598.00 frames. ], tot_loss[loss=0.2122, simple_loss=0.2948, pruned_loss=0.06481, over 4271219.07 frames. ], batch size: 298, lr: 2.64e-03, grad_scale: 32.0 2023-06-28 01:02:22,724 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-28 01:02:39,146 INFO [train.py:1028] (0/4) Epoch 11, validation: loss=0.2572, simple_loss=0.3509, pruned_loss=0.08176, over 1796401.00 frames. 2023-06-28 01:02:39,147 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23714MB 2023-06-28 01:03:11,961 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1937736.0, ans=0.05 2023-06-28 01:03:34,982 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1937796.0, ans=0.0 2023-06-28 01:03:50,083 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff3.min_abs, batch_count=1937856.0, ans=0.2 2023-06-28 01:04:21,556 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1937976.0, ans=0.125 2023-06-28 01:04:22,703 INFO [train.py:996] (0/4) Epoch 11, batch 18050, loss[loss=0.2556, simple_loss=0.317, pruned_loss=0.09708, over 21333.00 frames. ], tot_loss[loss=0.2089, simple_loss=0.2899, pruned_loss=0.06396, over 4265914.39 frames. ], batch size: 471, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:04:58,012 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.600e+02 6.639e+02 9.648e+02 1.453e+03 3.276e+03, threshold=1.930e+03, percent-clipped=8.0 2023-06-28 01:05:21,900 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1938156.0, ans=0.125 2023-06-28 01:05:44,911 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1938156.0, ans=0.125 2023-06-28 01:05:46,667 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1938156.0, ans=0.125 2023-06-28 01:06:10,713 INFO [train.py:996] (0/4) Epoch 11, batch 18100, loss[loss=0.2179, simple_loss=0.2935, pruned_loss=0.07115, over 21844.00 frames. ], tot_loss[loss=0.213, simple_loss=0.2938, pruned_loss=0.06607, over 4265274.21 frames. ], batch size: 107, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:06:11,295 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1938276.0, ans=0.125 2023-06-28 01:07:48,925 INFO [train.py:996] (0/4) Epoch 11, batch 18150, loss[loss=0.1995, simple_loss=0.2733, pruned_loss=0.06283, over 21625.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2948, pruned_loss=0.06645, over 4253146.44 frames. ], batch size: 247, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:08:03,926 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1938576.0, ans=0.2 2023-06-28 01:08:14,059 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1938636.0, ans=0.0 2023-06-28 01:08:18,396 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.434e+02 6.385e+02 9.174e+02 1.252e+03 3.670e+03, threshold=1.835e+03, percent-clipped=3.0 2023-06-28 01:08:35,616 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.94 vs. limit=15.0 2023-06-28 01:09:11,625 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1938816.0, ans=0.125 2023-06-28 01:09:20,748 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.64 vs. limit=15.0 2023-06-28 01:09:24,180 INFO [train.py:996] (0/4) Epoch 11, batch 18200, loss[loss=0.185, simple_loss=0.2608, pruned_loss=0.05462, over 21700.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2901, pruned_loss=0.06618, over 4257624.89 frames. ], batch size: 282, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:09:50,689 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1938936.0, ans=0.1 2023-06-28 01:09:57,521 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.95 vs. limit=22.5 2023-06-28 01:10:04,877 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1938996.0, ans=0.0 2023-06-28 01:10:06,563 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1938996.0, ans=0.125 2023-06-28 01:10:24,233 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1939056.0, ans=0.125 2023-06-28 01:11:04,699 INFO [train.py:996] (0/4) Epoch 11, batch 18250, loss[loss=0.2248, simple_loss=0.2903, pruned_loss=0.07963, over 21730.00 frames. ], tot_loss[loss=0.2061, simple_loss=0.2836, pruned_loss=0.06432, over 4260646.23 frames. ], batch size: 441, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:11:37,984 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.242e+02 6.955e+02 1.102e+03 1.552e+03 2.927e+03, threshold=2.205e+03, percent-clipped=10.0 2023-06-28 01:11:55,209 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1939296.0, ans=0.125 2023-06-28 01:12:10,139 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1939356.0, ans=0.125 2023-06-28 01:12:10,174 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1939356.0, ans=0.125 2023-06-28 01:12:17,692 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.66 vs. limit=15.0 2023-06-28 01:12:46,286 INFO [train.py:996] (0/4) Epoch 11, batch 18300, loss[loss=0.2252, simple_loss=0.3296, pruned_loss=0.06034, over 21716.00 frames. ], tot_loss[loss=0.2058, simple_loss=0.283, pruned_loss=0.06435, over 4265739.62 frames. ], batch size: 247, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:13:20,700 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.70 vs. limit=12.0 2023-06-28 01:13:23,480 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1939596.0, ans=0.125 2023-06-28 01:13:30,296 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 01:14:18,362 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1939716.0, ans=0.0 2023-06-28 01:14:22,407 INFO [train.py:996] (0/4) Epoch 11, batch 18350, loss[loss=0.198, simple_loss=0.2598, pruned_loss=0.06808, over 21363.00 frames. ], tot_loss[loss=0.2083, simple_loss=0.2884, pruned_loss=0.0641, over 4263389.33 frames. ], batch size: 194, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:14:31,368 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1939776.0, ans=0.1 2023-06-28 01:14:56,374 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.730e+02 6.827e+02 1.100e+03 1.659e+03 4.791e+03, threshold=2.200e+03, percent-clipped=14.0 2023-06-28 01:14:58,877 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1939836.0, ans=0.0 2023-06-28 01:15:37,845 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 01:16:05,020 INFO [train.py:996] (0/4) Epoch 11, batch 18400, loss[loss=0.1918, simple_loss=0.2707, pruned_loss=0.05648, over 21723.00 frames. ], tot_loss[loss=0.2055, simple_loss=0.2855, pruned_loss=0.06275, over 4263074.77 frames. ], batch size: 112, lr: 2.64e-03, grad_scale: 32.0 2023-06-28 01:16:50,520 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1940196.0, ans=0.2 2023-06-28 01:16:58,860 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1940256.0, ans=0.2 2023-06-28 01:17:03,630 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1940256.0, ans=0.125 2023-06-28 01:17:37,794 INFO [train.py:996] (0/4) Epoch 11, batch 18450, loss[loss=0.1564, simple_loss=0.2511, pruned_loss=0.03086, over 21791.00 frames. ], tot_loss[loss=0.2009, simple_loss=0.2818, pruned_loss=0.06, over 4258402.33 frames. ], batch size: 352, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:17:45,465 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1940376.0, ans=0.0 2023-06-28 01:18:14,205 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.159e+02 6.017e+02 7.931e+02 1.267e+03 3.301e+03, threshold=1.586e+03, percent-clipped=3.0 2023-06-28 01:19:15,637 INFO [train.py:996] (0/4) Epoch 11, batch 18500, loss[loss=0.1986, simple_loss=0.2928, pruned_loss=0.05222, over 21649.00 frames. ], tot_loss[loss=0.1978, simple_loss=0.2782, pruned_loss=0.0587, over 4249951.11 frames. ], batch size: 441, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:19:17,998 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1940676.0, ans=0.125 2023-06-28 01:19:48,985 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1940736.0, ans=0.125 2023-06-28 01:20:01,952 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1940796.0, ans=0.5 2023-06-28 01:20:03,621 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1940796.0, ans=0.125 2023-06-28 01:20:49,045 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.42 vs. limit=6.0 2023-06-28 01:20:55,565 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.48 vs. limit=22.5 2023-06-28 01:20:57,749 INFO [train.py:996] (0/4) Epoch 11, batch 18550, loss[loss=0.1778, simple_loss=0.2501, pruned_loss=0.05275, over 21707.00 frames. ], tot_loss[loss=0.1958, simple_loss=0.2758, pruned_loss=0.05793, over 4251248.25 frames. ], batch size: 124, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:20:59,832 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1940976.0, ans=0.035 2023-06-28 01:21:29,876 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1941036.0, ans=10.0 2023-06-28 01:21:34,190 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.234e+02 6.100e+02 9.556e+02 1.452e+03 3.261e+03, threshold=1.911e+03, percent-clipped=19.0 2023-06-28 01:22:02,899 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1941156.0, ans=0.0 2023-06-28 01:22:23,223 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1941216.0, ans=0.0 2023-06-28 01:22:44,356 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_na.min_abs, batch_count=1941276.0, ans=0.02 2023-06-28 01:22:45,326 INFO [train.py:996] (0/4) Epoch 11, batch 18600, loss[loss=0.2599, simple_loss=0.3426, pruned_loss=0.08856, over 21539.00 frames. ], tot_loss[loss=0.1959, simple_loss=0.2744, pruned_loss=0.05877, over 4252268.80 frames. ], batch size: 473, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:22:50,874 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1941276.0, ans=0.04949747468305833 2023-06-28 01:23:12,077 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1941336.0, ans=0.015 2023-06-28 01:23:17,412 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 01:23:39,832 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1941396.0, ans=0.0 2023-06-28 01:24:26,401 INFO [train.py:996] (0/4) Epoch 11, batch 18650, loss[loss=0.1858, simple_loss=0.255, pruned_loss=0.05833, over 21240.00 frames. ], tot_loss[loss=0.1967, simple_loss=0.2744, pruned_loss=0.05948, over 4245329.30 frames. ], batch size: 177, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:24:40,944 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.02 vs. limit=6.0 2023-06-28 01:24:52,416 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.322e+02 7.479e+02 1.141e+03 1.737e+03 3.586e+03, threshold=2.283e+03, percent-clipped=19.0 2023-06-28 01:25:35,694 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1941756.0, ans=0.125 2023-06-28 01:25:45,554 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1941816.0, ans=0.125 2023-06-28 01:25:48,664 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1941816.0, ans=0.125 2023-06-28 01:25:57,768 INFO [train.py:996] (0/4) Epoch 11, batch 18700, loss[loss=0.1792, simple_loss=0.2466, pruned_loss=0.05587, over 21579.00 frames. ], tot_loss[loss=0.1967, simple_loss=0.2717, pruned_loss=0.06084, over 4242393.64 frames. ], batch size: 247, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:27:04,342 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.83 vs. limit=6.0 2023-06-28 01:27:32,026 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.04 vs. limit=12.0 2023-06-28 01:27:40,759 INFO [train.py:996] (0/4) Epoch 11, batch 18750, loss[loss=0.2586, simple_loss=0.3435, pruned_loss=0.08689, over 21290.00 frames. ], tot_loss[loss=0.2013, simple_loss=0.2753, pruned_loss=0.0637, over 4259013.06 frames. ], batch size: 548, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:27:49,797 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1942176.0, ans=0.125 2023-06-28 01:28:17,084 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.499e+02 6.198e+02 1.010e+03 1.418e+03 2.835e+03, threshold=2.020e+03, percent-clipped=5.0 2023-06-28 01:28:17,686 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1942236.0, ans=0.2 2023-06-28 01:28:18,359 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.58 vs. limit=15.0 2023-06-28 01:28:52,393 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1942356.0, ans=0.125 2023-06-28 01:29:02,154 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1942356.0, ans=0.125 2023-06-28 01:29:15,489 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1942416.0, ans=0.125 2023-06-28 01:29:20,590 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 01:29:20,631 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1942416.0, ans=10.0 2023-06-28 01:29:23,219 INFO [train.py:996] (0/4) Epoch 11, batch 18800, loss[loss=0.2065, simple_loss=0.2762, pruned_loss=0.06841, over 21217.00 frames. ], tot_loss[loss=0.2034, simple_loss=0.279, pruned_loss=0.06393, over 4246245.35 frames. ], batch size: 176, lr: 2.64e-03, grad_scale: 32.0 2023-06-28 01:30:44,716 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1942656.0, ans=0.125 2023-06-28 01:30:48,835 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.10 vs. limit=15.0 2023-06-28 01:31:04,465 INFO [train.py:996] (0/4) Epoch 11, batch 18850, loss[loss=0.1777, simple_loss=0.2744, pruned_loss=0.0405, over 21750.00 frames. ], tot_loss[loss=0.1988, simple_loss=0.2769, pruned_loss=0.06037, over 4244814.43 frames. ], batch size: 298, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:31:16,388 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1942776.0, ans=0.0 2023-06-28 01:31:31,035 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1942836.0, ans=0.0 2023-06-28 01:31:37,313 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1942836.0, ans=0.1 2023-06-28 01:31:41,973 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.145e+02 6.934e+02 1.004e+03 1.636e+03 4.618e+03, threshold=2.007e+03, percent-clipped=13.0 2023-06-28 01:31:42,543 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1942836.0, ans=0.125 2023-06-28 01:31:54,362 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1942896.0, ans=0.09899494936611666 2023-06-28 01:31:55,959 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1942896.0, ans=0.0 2023-06-28 01:32:25,466 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1942956.0, ans=0.1 2023-06-28 01:32:27,499 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.75 vs. limit=15.0 2023-06-28 01:32:41,914 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1943016.0, ans=0.0 2023-06-28 01:32:46,430 INFO [train.py:996] (0/4) Epoch 11, batch 18900, loss[loss=0.2025, simple_loss=0.2683, pruned_loss=0.06829, over 21763.00 frames. ], tot_loss[loss=0.1964, simple_loss=0.2729, pruned_loss=0.05992, over 4249525.77 frames. ], batch size: 102, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:33:30,200 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1943196.0, ans=0.0 2023-06-28 01:33:33,885 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.69 vs. limit=15.0 2023-06-28 01:33:46,883 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1943256.0, ans=0.1 2023-06-28 01:33:57,952 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1943256.0, ans=0.04949747468305833 2023-06-28 01:34:18,949 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1943316.0, ans=0.125 2023-06-28 01:34:28,563 INFO [train.py:996] (0/4) Epoch 11, batch 18950, loss[loss=0.2047, simple_loss=0.2811, pruned_loss=0.06415, over 21878.00 frames. ], tot_loss[loss=0.1976, simple_loss=0.2728, pruned_loss=0.06118, over 4260257.71 frames. ], batch size: 124, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:35:03,221 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1943436.0, ans=0.2 2023-06-28 01:35:07,394 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.266e+02 7.363e+02 1.116e+03 1.715e+03 3.795e+03, threshold=2.232e+03, percent-clipped=17.0 2023-06-28 01:35:18,101 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1943496.0, ans=0.125 2023-06-28 01:36:16,480 INFO [train.py:996] (0/4) Epoch 11, batch 19000, loss[loss=0.2312, simple_loss=0.3072, pruned_loss=0.07761, over 21499.00 frames. ], tot_loss[loss=0.2041, simple_loss=0.2823, pruned_loss=0.06293, over 4265036.40 frames. ], batch size: 194, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:36:24,918 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1943676.0, ans=0.1 2023-06-28 01:36:44,889 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1943736.0, ans=0.2 2023-06-28 01:36:49,763 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1943736.0, ans=0.125 2023-06-28 01:36:51,494 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1943736.0, ans=0.1 2023-06-28 01:37:15,209 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1943856.0, ans=0.125 2023-06-28 01:37:40,095 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1943916.0, ans=0.125 2023-06-28 01:37:56,353 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1943916.0, ans=0.125 2023-06-28 01:37:59,371 INFO [train.py:996] (0/4) Epoch 11, batch 19050, loss[loss=0.2132, simple_loss=0.2863, pruned_loss=0.07002, over 21806.00 frames. ], tot_loss[loss=0.2099, simple_loss=0.287, pruned_loss=0.06636, over 4271878.46 frames. ], batch size: 112, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:38:09,449 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-324000.pt 2023-06-28 01:38:30,513 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.71 vs. limit=15.0 2023-06-28 01:38:34,339 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.763e+02 7.359e+02 1.013e+03 1.496e+03 3.084e+03, threshold=2.026e+03, percent-clipped=8.0 2023-06-28 01:39:18,947 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1944156.0, ans=0.125 2023-06-28 01:39:33,900 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1944216.0, ans=0.125 2023-06-28 01:39:43,735 INFO [train.py:996] (0/4) Epoch 11, batch 19100, loss[loss=0.2131, simple_loss=0.2809, pruned_loss=0.07267, over 15711.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.2866, pruned_loss=0.06764, over 4268142.63 frames. ], batch size: 61, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:40:07,851 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1944336.0, ans=0.125 2023-06-28 01:40:11,410 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1944336.0, ans=0.0 2023-06-28 01:40:30,273 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1944396.0, ans=0.125 2023-06-28 01:40:42,004 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1944396.0, ans=0.0 2023-06-28 01:41:33,391 INFO [train.py:996] (0/4) Epoch 11, batch 19150, loss[loss=0.2653, simple_loss=0.3636, pruned_loss=0.08349, over 21860.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2901, pruned_loss=0.06897, over 4271839.43 frames. ], batch size: 372, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:41:42,843 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1944576.0, ans=0.125 2023-06-28 01:42:03,411 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1944636.0, ans=0.0 2023-06-28 01:42:09,609 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.832e+02 7.660e+02 1.202e+03 2.015e+03 4.043e+03, threshold=2.404e+03, percent-clipped=23.0 2023-06-28 01:43:08,157 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1944816.0, ans=0.125 2023-06-28 01:43:19,388 INFO [train.py:996] (0/4) Epoch 11, batch 19200, loss[loss=0.2576, simple_loss=0.3551, pruned_loss=0.08002, over 21731.00 frames. ], tot_loss[loss=0.219, simple_loss=0.2987, pruned_loss=0.06965, over 4273554.09 frames. ], batch size: 351, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:43:30,657 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1944876.0, ans=0.125 2023-06-28 01:43:32,200 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1944876.0, ans=0.125 2023-06-28 01:44:10,838 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=1944996.0, ans=10.0 2023-06-28 01:44:59,638 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.82 vs. limit=15.0 2023-06-28 01:45:00,763 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1945176.0, ans=0.0 2023-06-28 01:45:01,783 INFO [train.py:996] (0/4) Epoch 11, batch 19250, loss[loss=0.1736, simple_loss=0.2648, pruned_loss=0.04124, over 21665.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.2989, pruned_loss=0.06549, over 4275313.78 frames. ], batch size: 230, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:45:23,702 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1945236.0, ans=0.125 2023-06-28 01:45:32,065 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1945236.0, ans=0.0 2023-06-28 01:45:32,554 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=6.40 vs. limit=22.5 2023-06-28 01:45:36,159 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.112e+02 6.434e+02 9.084e+02 1.292e+03 2.942e+03, threshold=1.817e+03, percent-clipped=2.0 2023-06-28 01:45:46,250 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1945296.0, ans=0.125 2023-06-28 01:45:50,095 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.72 vs. limit=22.5 2023-06-28 01:46:09,043 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=1945356.0, ans=0.5 2023-06-28 01:46:43,088 INFO [train.py:996] (0/4) Epoch 11, batch 19300, loss[loss=0.2315, simple_loss=0.3101, pruned_loss=0.07644, over 21602.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2957, pruned_loss=0.0654, over 4286714.42 frames. ], batch size: 508, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:46:48,681 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 01:47:13,749 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1945536.0, ans=0.125 2023-06-28 01:47:26,815 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1945596.0, ans=0.1 2023-06-28 01:47:35,882 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.41 vs. limit=22.5 2023-06-28 01:48:03,001 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1945656.0, ans=0.0 2023-06-28 01:48:08,300 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1945716.0, ans=0.0 2023-06-28 01:48:20,048 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1945716.0, ans=0.2 2023-06-28 01:48:20,774 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.96 vs. limit=15.0 2023-06-28 01:48:25,859 INFO [train.py:996] (0/4) Epoch 11, batch 19350, loss[loss=0.1786, simple_loss=0.2645, pruned_loss=0.04632, over 21609.00 frames. ], tot_loss[loss=0.2076, simple_loss=0.2909, pruned_loss=0.06214, over 4281163.18 frames. ], batch size: 263, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:49:06,755 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.646e+02 6.544e+02 1.045e+03 1.616e+03 2.621e+03, threshold=2.089e+03, percent-clipped=15.0 2023-06-28 01:49:36,356 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1945956.0, ans=0.95 2023-06-28 01:49:55,916 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1946016.0, ans=0.0 2023-06-28 01:50:06,768 INFO [train.py:996] (0/4) Epoch 11, batch 19400, loss[loss=0.2002, simple_loss=0.2728, pruned_loss=0.06375, over 21724.00 frames. ], tot_loss[loss=0.2044, simple_loss=0.2869, pruned_loss=0.06101, over 4277003.06 frames. ], batch size: 263, lr: 2.64e-03, grad_scale: 8.0 2023-06-28 01:51:10,448 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.13 vs. limit=12.0 2023-06-28 01:51:14,757 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 01:51:33,234 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.37 vs. limit=8.0 2023-06-28 01:51:47,429 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=1946376.0, ans=0.025 2023-06-28 01:51:48,635 INFO [train.py:996] (0/4) Epoch 11, batch 19450, loss[loss=0.1958, simple_loss=0.268, pruned_loss=0.06176, over 21482.00 frames. ], tot_loss[loss=0.2049, simple_loss=0.2849, pruned_loss=0.06247, over 4282319.72 frames. ], batch size: 131, lr: 2.63e-03, grad_scale: 8.0 2023-06-28 01:52:30,299 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.692e+02 7.227e+02 1.148e+03 1.482e+03 2.916e+03, threshold=2.296e+03, percent-clipped=8.0 2023-06-28 01:52:45,524 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1946496.0, ans=0.1 2023-06-28 01:53:16,100 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1946616.0, ans=0.1 2023-06-28 01:53:22,812 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1946616.0, ans=0.125 2023-06-28 01:53:25,084 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.09 vs. limit=15.0 2023-06-28 01:53:29,808 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1946616.0, ans=0.0 2023-06-28 01:53:29,904 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1946616.0, ans=0.1 2023-06-28 01:53:29,918 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1946616.0, ans=0.0 2023-06-28 01:53:31,725 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1946676.0, ans=0.0 2023-06-28 01:53:32,672 INFO [train.py:996] (0/4) Epoch 11, batch 19500, loss[loss=0.2063, simple_loss=0.2872, pruned_loss=0.06268, over 21766.00 frames. ], tot_loss[loss=0.2037, simple_loss=0.2807, pruned_loss=0.06333, over 4284466.61 frames. ], batch size: 333, lr: 2.63e-03, grad_scale: 8.0 2023-06-28 01:54:52,238 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1946856.0, ans=0.1 2023-06-28 01:55:16,431 INFO [train.py:996] (0/4) Epoch 11, batch 19550, loss[loss=0.1643, simple_loss=0.2501, pruned_loss=0.03926, over 21768.00 frames. ], tot_loss[loss=0.2023, simple_loss=0.2786, pruned_loss=0.06303, over 4276366.17 frames. ], batch size: 282, lr: 2.63e-03, grad_scale: 8.0 2023-06-28 01:55:30,317 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1946976.0, ans=0.125 2023-06-28 01:55:53,296 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.26 vs. limit=15.0 2023-06-28 01:55:55,901 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1947036.0, ans=0.0 2023-06-28 01:55:57,082 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.189e+02 6.305e+02 9.070e+02 1.284e+03 2.823e+03, threshold=1.814e+03, percent-clipped=4.0 2023-06-28 01:56:15,423 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1947096.0, ans=0.125 2023-06-28 01:56:37,105 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1947156.0, ans=0.125 2023-06-28 01:56:57,998 INFO [train.py:996] (0/4) Epoch 11, batch 19600, loss[loss=0.2469, simple_loss=0.3133, pruned_loss=0.09026, over 21441.00 frames. ], tot_loss[loss=0.2035, simple_loss=0.2804, pruned_loss=0.06334, over 4285444.99 frames. ], batch size: 144, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 01:57:13,534 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1947276.0, ans=0.0 2023-06-28 01:57:42,059 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.42 vs. limit=15.0 2023-06-28 01:58:05,531 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 01:58:43,058 INFO [train.py:996] (0/4) Epoch 11, batch 19650, loss[loss=0.2066, simple_loss=0.285, pruned_loss=0.06415, over 21829.00 frames. ], tot_loss[loss=0.2099, simple_loss=0.2855, pruned_loss=0.06717, over 4286490.80 frames. ], batch size: 247, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 01:59:29,704 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.082e+02 7.409e+02 1.104e+03 1.587e+03 3.520e+03, threshold=2.207e+03, percent-clipped=14.0 2023-06-28 02:00:15,000 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1947816.0, ans=0.125 2023-06-28 02:00:39,364 INFO [train.py:996] (0/4) Epoch 11, batch 19700, loss[loss=0.1898, simple_loss=0.3018, pruned_loss=0.03892, over 20740.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2889, pruned_loss=0.06678, over 4280883.51 frames. ], batch size: 608, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:01:38,419 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.83 vs. limit=10.0 2023-06-28 02:01:43,878 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.82 vs. limit=15.0 2023-06-28 02:02:28,063 INFO [train.py:996] (0/4) Epoch 11, batch 19750, loss[loss=0.2106, simple_loss=0.3074, pruned_loss=0.0569, over 21431.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2973, pruned_loss=0.06786, over 4273396.79 frames. ], batch size: 211, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:02:29,339 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=6.11 vs. limit=22.5 2023-06-28 02:02:42,757 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.33 vs. limit=15.0 2023-06-28 02:02:45,704 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 02:03:04,771 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.869e+02 8.060e+02 1.121e+03 1.722e+03 5.088e+03, threshold=2.243e+03, percent-clipped=14.0 2023-06-28 02:04:04,622 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1948416.0, ans=0.125 2023-06-28 02:04:10,943 INFO [train.py:996] (0/4) Epoch 11, batch 19800, loss[loss=0.1782, simple_loss=0.2513, pruned_loss=0.05253, over 21527.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2965, pruned_loss=0.0683, over 4278880.69 frames. ], batch size: 212, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:04:55,048 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1948596.0, ans=0.2 2023-06-28 02:04:57,243 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.26 vs. limit=22.5 2023-06-28 02:05:49,601 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1948716.0, ans=0.0 2023-06-28 02:06:00,833 INFO [train.py:996] (0/4) Epoch 11, batch 19850, loss[loss=0.1722, simple_loss=0.2657, pruned_loss=0.03938, over 21746.00 frames. ], tot_loss[loss=0.2087, simple_loss=0.289, pruned_loss=0.06417, over 4271758.97 frames. ], batch size: 351, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:06:13,347 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1948776.0, ans=0.0 2023-06-28 02:06:14,946 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1948776.0, ans=0.0 2023-06-28 02:06:32,735 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.738e+02 8.105e+02 1.255e+03 1.783e+03 2.882e+03, threshold=2.510e+03, percent-clipped=10.0 2023-06-28 02:06:35,845 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.74 vs. limit=15.0 2023-06-28 02:06:52,868 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1948896.0, ans=0.125 2023-06-28 02:07:42,446 INFO [train.py:996] (0/4) Epoch 11, batch 19900, loss[loss=0.1785, simple_loss=0.2599, pruned_loss=0.04857, over 21568.00 frames. ], tot_loss[loss=0.2067, simple_loss=0.2895, pruned_loss=0.06194, over 4274551.82 frames. ], batch size: 263, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:09:25,681 INFO [train.py:996] (0/4) Epoch 11, batch 19950, loss[loss=0.1716, simple_loss=0.246, pruned_loss=0.04864, over 21552.00 frames. ], tot_loss[loss=0.203, simple_loss=0.2828, pruned_loss=0.06165, over 4273301.23 frames. ], batch size: 230, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:09:26,273 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1949376.0, ans=0.125 2023-06-28 02:09:28,576 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.40 vs. limit=15.0 2023-06-28 02:09:39,976 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1949376.0, ans=10.0 2023-06-28 02:09:58,011 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.624e+02 6.449e+02 8.969e+02 1.295e+03 2.845e+03, threshold=1.794e+03, percent-clipped=2.0 2023-06-28 02:10:57,007 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1949616.0, ans=0.0 2023-06-28 02:11:07,755 INFO [train.py:996] (0/4) Epoch 11, batch 20000, loss[loss=0.2148, simple_loss=0.2884, pruned_loss=0.07061, over 21860.00 frames. ], tot_loss[loss=0.2057, simple_loss=0.2859, pruned_loss=0.06273, over 4276927.79 frames. ], batch size: 351, lr: 2.63e-03, grad_scale: 32.0 2023-06-28 02:11:12,045 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1949676.0, ans=0.125 2023-06-28 02:11:36,130 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.36 vs. limit=22.5 2023-06-28 02:12:49,281 INFO [train.py:996] (0/4) Epoch 11, batch 20050, loss[loss=0.2387, simple_loss=0.3082, pruned_loss=0.08462, over 21743.00 frames. ], tot_loss[loss=0.208, simple_loss=0.2869, pruned_loss=0.06457, over 4275089.32 frames. ], batch size: 389, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:13:09,711 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1950036.0, ans=0.125 2023-06-28 02:13:27,885 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.048e+02 6.719e+02 1.022e+03 1.464e+03 2.848e+03, threshold=2.043e+03, percent-clipped=12.0 2023-06-28 02:13:38,383 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1950096.0, ans=0.0 2023-06-28 02:14:30,378 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1950216.0, ans=0.125 2023-06-28 02:14:33,043 INFO [train.py:996] (0/4) Epoch 11, batch 20100, loss[loss=0.2431, simple_loss=0.3428, pruned_loss=0.07167, over 21858.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.2892, pruned_loss=0.06624, over 4279294.81 frames. ], batch size: 371, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:14:42,334 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1950276.0, ans=0.125 2023-06-28 02:15:31,479 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1950396.0, ans=0.125 2023-06-28 02:15:38,228 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1950456.0, ans=0.125 2023-06-28 02:15:56,667 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1950456.0, ans=0.125 2023-06-28 02:16:16,920 INFO [train.py:996] (0/4) Epoch 11, batch 20150, loss[loss=0.2986, simple_loss=0.3559, pruned_loss=0.1207, over 21338.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2976, pruned_loss=0.06831, over 4276682.76 frames. ], batch size: 507, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:16:25,833 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1950576.0, ans=0.125 2023-06-28 02:17:06,276 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.168e+02 7.352e+02 1.035e+03 1.689e+03 3.687e+03, threshold=2.071e+03, percent-clipped=15.0 2023-06-28 02:17:32,813 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1950756.0, ans=0.0 2023-06-28 02:18:07,653 INFO [train.py:996] (0/4) Epoch 11, batch 20200, loss[loss=0.2564, simple_loss=0.3782, pruned_loss=0.06727, over 19936.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.3037, pruned_loss=0.07165, over 4276002.27 frames. ], batch size: 702, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:19:09,496 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1950996.0, ans=0.2 2023-06-28 02:19:51,071 INFO [train.py:996] (0/4) Epoch 11, batch 20250, loss[loss=0.1996, simple_loss=0.289, pruned_loss=0.05515, over 21735.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.3047, pruned_loss=0.07044, over 4277072.31 frames. ], batch size: 247, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:20:39,493 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.229e+02 6.229e+02 9.670e+02 1.265e+03 2.835e+03, threshold=1.934e+03, percent-clipped=7.0 2023-06-28 02:21:37,853 INFO [train.py:996] (0/4) Epoch 11, batch 20300, loss[loss=0.1842, simple_loss=0.2725, pruned_loss=0.04794, over 21314.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.3033, pruned_loss=0.06839, over 4271595.46 frames. ], batch size: 194, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:22:22,432 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1951596.0, ans=0.125 2023-06-28 02:22:24,172 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1951596.0, ans=0.0 2023-06-28 02:23:13,331 INFO [train.py:996] (0/4) Epoch 11, batch 20350, loss[loss=0.2458, simple_loss=0.3128, pruned_loss=0.08942, over 21267.00 frames. ], tot_loss[loss=0.219, simple_loss=0.3023, pruned_loss=0.06785, over 4270059.09 frames. ], batch size: 143, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:24:01,021 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.709e+02 6.451e+02 8.868e+02 1.412e+03 2.811e+03, threshold=1.774e+03, percent-clipped=7.0 2023-06-28 02:24:16,903 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1951956.0, ans=0.0 2023-06-28 02:24:21,761 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1951956.0, ans=0.0 2023-06-28 02:24:53,594 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1952016.0, ans=0.1 2023-06-28 02:24:56,175 INFO [train.py:996] (0/4) Epoch 11, batch 20400, loss[loss=0.252, simple_loss=0.3373, pruned_loss=0.08334, over 21420.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.3049, pruned_loss=0.07061, over 4271006.36 frames. ], batch size: 548, lr: 2.63e-03, grad_scale: 32.0 2023-06-28 02:25:14,077 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 02:25:50,239 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1952196.0, ans=0.125 2023-06-28 02:26:24,518 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1952316.0, ans=0.125 2023-06-28 02:26:37,002 INFO [train.py:996] (0/4) Epoch 11, batch 20450, loss[loss=0.1946, simple_loss=0.2774, pruned_loss=0.05592, over 16419.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.306, pruned_loss=0.07321, over 4259193.89 frames. ], batch size: 62, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:26:49,551 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.00 vs. limit=15.0 2023-06-28 02:27:22,562 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1952496.0, ans=0.0 2023-06-28 02:27:25,095 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.694e+02 8.138e+02 1.140e+03 1.534e+03 2.680e+03, threshold=2.280e+03, percent-clipped=12.0 2023-06-28 02:28:03,677 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1952616.0, ans=0.0 2023-06-28 02:28:17,764 INFO [train.py:996] (0/4) Epoch 11, batch 20500, loss[loss=0.1949, simple_loss=0.2683, pruned_loss=0.0607, over 21792.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.3003, pruned_loss=0.07255, over 4255613.09 frames. ], batch size: 124, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:28:18,185 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1952676.0, ans=0.125 2023-06-28 02:29:01,327 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=1952736.0, ans=10.0 2023-06-28 02:29:01,421 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1952736.0, ans=0.125 2023-06-28 02:29:16,629 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1952796.0, ans=0.0 2023-06-28 02:29:51,925 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1952916.0, ans=0.2 2023-06-28 02:30:04,137 INFO [train.py:996] (0/4) Epoch 11, batch 20550, loss[loss=0.1775, simple_loss=0.2438, pruned_loss=0.05562, over 21159.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2929, pruned_loss=0.07092, over 4252962.10 frames. ], batch size: 176, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:30:24,966 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1953036.0, ans=0.0 2023-06-28 02:30:49,268 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.859e+02 7.218e+02 1.038e+03 1.367e+03 4.804e+03, threshold=2.077e+03, percent-clipped=4.0 2023-06-28 02:31:01,423 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1953096.0, ans=0.1 2023-06-28 02:31:06,453 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1953156.0, ans=0.1 2023-06-28 02:31:08,104 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1953156.0, ans=0.125 2023-06-28 02:31:35,169 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1953216.0, ans=0.1 2023-06-28 02:31:36,797 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1953216.0, ans=0.0 2023-06-28 02:31:42,648 INFO [train.py:996] (0/4) Epoch 11, batch 20600, loss[loss=0.2344, simple_loss=0.3084, pruned_loss=0.08016, over 22085.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2958, pruned_loss=0.06946, over 4240348.45 frames. ], batch size: 119, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:31:43,863 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.58 vs. limit=15.0 2023-06-28 02:31:57,006 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1953276.0, ans=0.04949747468305833 2023-06-28 02:32:02,865 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.26 vs. limit=15.0 2023-06-28 02:32:22,427 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.95 vs. limit=15.0 2023-06-28 02:33:09,967 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.11 vs. limit=15.0 2023-06-28 02:33:10,206 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.52 vs. limit=22.5 2023-06-28 02:33:28,456 INFO [train.py:996] (0/4) Epoch 11, batch 20650, loss[loss=0.1834, simple_loss=0.25, pruned_loss=0.05834, over 21454.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2923, pruned_loss=0.06991, over 4252831.01 frames. ], batch size: 195, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:34:07,850 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.56 vs. limit=15.0 2023-06-28 02:34:13,041 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.450e+02 6.069e+02 8.420e+02 1.112e+03 2.688e+03, threshold=1.684e+03, percent-clipped=4.0 2023-06-28 02:34:42,304 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1953756.0, ans=0.1 2023-06-28 02:34:49,078 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1953816.0, ans=0.0 2023-06-28 02:35:11,587 INFO [train.py:996] (0/4) Epoch 11, batch 20700, loss[loss=0.227, simple_loss=0.3247, pruned_loss=0.06464, over 21229.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.2853, pruned_loss=0.06665, over 4243404.19 frames. ], batch size: 548, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:35:27,770 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.35 vs. limit=15.0 2023-06-28 02:36:04,887 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1953996.0, ans=0.125 2023-06-28 02:36:16,372 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1954056.0, ans=0.125 2023-06-28 02:36:33,802 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1954116.0, ans=0.125 2023-06-28 02:36:33,816 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1954116.0, ans=0.1 2023-06-28 02:37:00,154 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1954176.0, ans=0.2 2023-06-28 02:37:05,867 INFO [train.py:996] (0/4) Epoch 11, batch 20750, loss[loss=0.2425, simple_loss=0.363, pruned_loss=0.06098, over 20800.00 frames. ], tot_loss[loss=0.2089, simple_loss=0.2872, pruned_loss=0.06535, over 4244411.24 frames. ], batch size: 607, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:37:42,623 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1954236.0, ans=0.125 2023-06-28 02:37:45,750 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1954296.0, ans=0.125 2023-06-28 02:37:46,762 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.132e+02 6.969e+02 1.049e+03 1.420e+03 3.386e+03, threshold=2.099e+03, percent-clipped=18.0 2023-06-28 02:38:31,294 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1954416.0, ans=0.125 2023-06-28 02:38:48,458 INFO [train.py:996] (0/4) Epoch 11, batch 20800, loss[loss=0.1875, simple_loss=0.2553, pruned_loss=0.05988, over 21830.00 frames. ], tot_loss[loss=0.212, simple_loss=0.2916, pruned_loss=0.06618, over 4243602.21 frames. ], batch size: 118, lr: 2.63e-03, grad_scale: 32.0 2023-06-28 02:38:49,693 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.93 vs. limit=15.0 2023-06-28 02:38:49,729 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.84 vs. limit=15.0 2023-06-28 02:39:04,126 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1954536.0, ans=0.125 2023-06-28 02:39:11,281 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.84 vs. limit=22.5 2023-06-28 02:39:13,822 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1954536.0, ans=0.0 2023-06-28 02:39:21,825 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1954536.0, ans=0.0 2023-06-28 02:39:23,508 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1954536.0, ans=0.2 2023-06-28 02:40:07,092 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1954716.0, ans=0.125 2023-06-28 02:40:27,993 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.96 vs. limit=15.0 2023-06-28 02:40:30,173 INFO [train.py:996] (0/4) Epoch 11, batch 20850, loss[loss=0.2239, simple_loss=0.3003, pruned_loss=0.07373, over 22004.00 frames. ], tot_loss[loss=0.2058, simple_loss=0.2835, pruned_loss=0.06405, over 4245696.44 frames. ], batch size: 113, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:41:11,759 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.093e+02 6.813e+02 9.986e+02 1.626e+03 4.926e+03, threshold=1.997e+03, percent-clipped=17.0 2023-06-28 02:41:19,347 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=1954896.0, ans=0.5 2023-06-28 02:41:25,991 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1954956.0, ans=0.125 2023-06-28 02:42:12,913 INFO [train.py:996] (0/4) Epoch 11, batch 20900, loss[loss=0.225, simple_loss=0.3046, pruned_loss=0.07271, over 21781.00 frames. ], tot_loss[loss=0.207, simple_loss=0.284, pruned_loss=0.06498, over 4259484.13 frames. ], batch size: 391, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:42:19,825 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1955076.0, ans=0.125 2023-06-28 02:42:21,600 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1955076.0, ans=0.125 2023-06-28 02:42:55,889 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 02:42:57,386 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1955196.0, ans=0.125 2023-06-28 02:43:20,038 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1955256.0, ans=10.0 2023-06-28 02:43:46,920 INFO [train.py:996] (0/4) Epoch 11, batch 20950, loss[loss=0.1768, simple_loss=0.26, pruned_loss=0.04676, over 21503.00 frames. ], tot_loss[loss=0.2032, simple_loss=0.2813, pruned_loss=0.06256, over 4266402.44 frames. ], batch size: 212, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:43:48,124 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.27 vs. limit=15.0 2023-06-28 02:43:58,372 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=1955376.0, ans=0.05 2023-06-28 02:44:06,994 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.74 vs. limit=22.5 2023-06-28 02:44:14,823 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1955436.0, ans=0.0 2023-06-28 02:44:24,718 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=5.84 vs. limit=22.5 2023-06-28 02:44:26,754 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.631e+02 6.600e+02 1.009e+03 1.481e+03 3.746e+03, threshold=2.018e+03, percent-clipped=8.0 2023-06-28 02:44:30,463 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1955496.0, ans=0.0 2023-06-28 02:45:13,203 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.47 vs. limit=15.0 2023-06-28 02:45:23,298 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1955616.0, ans=0.125 2023-06-28 02:45:25,843 INFO [train.py:996] (0/4) Epoch 11, batch 21000, loss[loss=0.2079, simple_loss=0.2847, pruned_loss=0.06554, over 21933.00 frames. ], tot_loss[loss=0.2026, simple_loss=0.2798, pruned_loss=0.06273, over 4274528.87 frames. ], batch size: 316, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:45:25,844 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-28 02:45:45,777 INFO [train.py:1028] (0/4) Epoch 11, validation: loss=0.2661, simple_loss=0.3574, pruned_loss=0.08743, over 1796401.00 frames. 2023-06-28 02:45:45,778 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23714MB 2023-06-28 02:45:59,513 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1955676.0, ans=0.0 2023-06-28 02:46:01,743 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.18 vs. limit=10.0 2023-06-28 02:47:13,492 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=1955916.0, ans=10.0 2023-06-28 02:47:13,500 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1955916.0, ans=0.035 2023-06-28 02:47:20,045 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1955916.0, ans=0.0 2023-06-28 02:47:22,978 INFO [train.py:996] (0/4) Epoch 11, batch 21050, loss[loss=0.211, simple_loss=0.2966, pruned_loss=0.0627, over 16118.00 frames. ], tot_loss[loss=0.2023, simple_loss=0.2788, pruned_loss=0.06284, over 4272732.86 frames. ], batch size: 64, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:47:38,869 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=1955976.0, ans=15.0 2023-06-28 02:47:42,371 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.27 vs. limit=15.0 2023-06-28 02:48:06,583 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1956096.0, ans=0.0 2023-06-28 02:48:09,001 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.407e+02 6.035e+02 7.930e+02 1.297e+03 2.545e+03, threshold=1.586e+03, percent-clipped=7.0 2023-06-28 02:48:13,048 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1956096.0, ans=0.125 2023-06-28 02:48:33,350 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.71 vs. limit=6.0 2023-06-28 02:48:34,176 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1956156.0, ans=0.2 2023-06-28 02:48:39,167 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1956216.0, ans=0.125 2023-06-28 02:48:54,248 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.88 vs. limit=15.0 2023-06-28 02:49:00,425 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1956216.0, ans=0.125 2023-06-28 02:49:04,840 INFO [train.py:996] (0/4) Epoch 11, batch 21100, loss[loss=0.1806, simple_loss=0.2471, pruned_loss=0.057, over 21674.00 frames. ], tot_loss[loss=0.2008, simple_loss=0.2755, pruned_loss=0.06298, over 4269114.28 frames. ], batch size: 248, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:50:40,602 INFO [train.py:996] (0/4) Epoch 11, batch 21150, loss[loss=0.1766, simple_loss=0.2412, pruned_loss=0.05602, over 21595.00 frames. ], tot_loss[loss=0.1987, simple_loss=0.2717, pruned_loss=0.06289, over 4264101.34 frames. ], batch size: 263, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:50:43,075 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1956576.0, ans=0.1 2023-06-28 02:50:44,730 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1956576.0, ans=0.2 2023-06-28 02:51:26,154 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.779e+02 6.313e+02 9.139e+02 1.246e+03 3.367e+03, threshold=1.828e+03, percent-clipped=14.0 2023-06-28 02:51:57,309 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1956816.0, ans=0.125 2023-06-28 02:51:57,910 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.57 vs. limit=12.0 2023-06-28 02:52:09,409 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.24 vs. limit=12.0 2023-06-28 02:52:15,580 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1956876.0, ans=0.125 2023-06-28 02:52:16,462 INFO [train.py:996] (0/4) Epoch 11, batch 21200, loss[loss=0.2031, simple_loss=0.273, pruned_loss=0.06659, over 21336.00 frames. ], tot_loss[loss=0.1962, simple_loss=0.2681, pruned_loss=0.06214, over 4269326.19 frames. ], batch size: 471, lr: 2.63e-03, grad_scale: 32.0 2023-06-28 02:52:46,753 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1956936.0, ans=0.0 2023-06-28 02:53:50,828 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.34 vs. limit=22.5 2023-06-28 02:53:58,173 INFO [train.py:996] (0/4) Epoch 11, batch 21250, loss[loss=0.1975, simple_loss=0.27, pruned_loss=0.06254, over 16061.00 frames. ], tot_loss[loss=0.1943, simple_loss=0.2651, pruned_loss=0.06177, over 4258555.19 frames. ], batch size: 65, lr: 2.63e-03, grad_scale: 8.0 2023-06-28 02:54:47,831 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.642e+02 7.284e+02 1.070e+03 1.587e+03 2.954e+03, threshold=2.141e+03, percent-clipped=16.0 2023-06-28 02:54:51,795 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1957296.0, ans=0.2 2023-06-28 02:55:14,156 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.73 vs. limit=15.0 2023-06-28 02:55:14,966 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1957416.0, ans=0.09899494936611666 2023-06-28 02:55:39,443 INFO [train.py:996] (0/4) Epoch 11, batch 21300, loss[loss=0.2205, simple_loss=0.2875, pruned_loss=0.07678, over 21609.00 frames. ], tot_loss[loss=0.1998, simple_loss=0.2717, pruned_loss=0.06396, over 4267382.23 frames. ], batch size: 548, lr: 2.63e-03, grad_scale: 8.0 2023-06-28 02:56:08,515 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1957536.0, ans=0.125 2023-06-28 02:56:15,636 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.50 vs. limit=15.0 2023-06-28 02:56:41,531 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1957656.0, ans=0.0 2023-06-28 02:56:48,121 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1957656.0, ans=0.125 2023-06-28 02:57:22,802 INFO [train.py:996] (0/4) Epoch 11, batch 21350, loss[loss=0.1897, simple_loss=0.2641, pruned_loss=0.05765, over 21163.00 frames. ], tot_loss[loss=0.203, simple_loss=0.2767, pruned_loss=0.06462, over 4265284.98 frames. ], batch size: 143, lr: 2.63e-03, grad_scale: 8.0 2023-06-28 02:57:23,238 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1957776.0, ans=0.2 2023-06-28 02:57:29,882 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1957776.0, ans=0.2 2023-06-28 02:57:42,112 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1957836.0, ans=0.125 2023-06-28 02:58:08,238 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.369e+02 7.362e+02 1.168e+03 1.519e+03 3.106e+03, threshold=2.337e+03, percent-clipped=14.0 2023-06-28 02:58:37,636 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1957956.0, ans=0.125 2023-06-28 02:58:42,517 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1957956.0, ans=0.2 2023-06-28 02:58:42,588 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1957956.0, ans=0.125 2023-06-28 02:58:42,606 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1957956.0, ans=0.2 2023-06-28 02:58:59,806 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1958016.0, ans=0.2 2023-06-28 02:59:01,462 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1958016.0, ans=0.0 2023-06-28 02:59:07,585 INFO [train.py:996] (0/4) Epoch 11, batch 21400, loss[loss=0.184, simple_loss=0.2432, pruned_loss=0.06242, over 20228.00 frames. ], tot_loss[loss=0.2048, simple_loss=0.2803, pruned_loss=0.0647, over 4264087.54 frames. ], batch size: 703, lr: 2.63e-03, grad_scale: 8.0 2023-06-28 02:59:14,906 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1958076.0, ans=0.125 2023-06-28 02:59:42,699 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1958136.0, ans=10.0 2023-06-28 02:59:47,899 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1958196.0, ans=0.5 2023-06-28 03:00:49,132 INFO [train.py:996] (0/4) Epoch 11, batch 21450, loss[loss=0.2122, simple_loss=0.2772, pruned_loss=0.07361, over 21423.00 frames. ], tot_loss[loss=0.2069, simple_loss=0.2821, pruned_loss=0.06588, over 4265183.84 frames. ], batch size: 144, lr: 2.63e-03, grad_scale: 8.0 2023-06-28 03:01:33,849 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.846e+02 6.247e+02 7.898e+02 1.203e+03 2.207e+03, threshold=1.580e+03, percent-clipped=0.0 2023-06-28 03:01:45,693 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1958496.0, ans=0.0 2023-06-28 03:02:30,237 INFO [train.py:996] (0/4) Epoch 11, batch 21500, loss[loss=0.234, simple_loss=0.3113, pruned_loss=0.07838, over 21878.00 frames. ], tot_loss[loss=0.2072, simple_loss=0.281, pruned_loss=0.06672, over 4268104.66 frames. ], batch size: 107, lr: 2.63e-03, grad_scale: 8.0 2023-06-28 03:02:30,762 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1958676.0, ans=0.0 2023-06-28 03:02:32,511 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1958676.0, ans=0.125 2023-06-28 03:02:41,980 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1958676.0, ans=0.125 2023-06-28 03:03:02,149 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.67 vs. limit=15.0 2023-06-28 03:03:18,049 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1958796.0, ans=0.1 2023-06-28 03:04:11,236 INFO [train.py:996] (0/4) Epoch 11, batch 21550, loss[loss=0.1786, simple_loss=0.2492, pruned_loss=0.05398, over 21282.00 frames. ], tot_loss[loss=0.2028, simple_loss=0.2758, pruned_loss=0.06492, over 4255281.32 frames. ], batch size: 176, lr: 2.63e-03, grad_scale: 8.0 2023-06-28 03:04:56,025 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.241e+02 6.233e+02 9.531e+02 1.253e+03 2.671e+03, threshold=1.906e+03, percent-clipped=10.0 2023-06-28 03:05:06,431 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1959096.0, ans=0.125 2023-06-28 03:05:19,846 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1959156.0, ans=0.2 2023-06-28 03:05:49,819 INFO [train.py:996] (0/4) Epoch 11, batch 21600, loss[loss=0.185, simple_loss=0.2491, pruned_loss=0.06042, over 21602.00 frames. ], tot_loss[loss=0.1987, simple_loss=0.2713, pruned_loss=0.06303, over 4259389.53 frames. ], batch size: 415, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 03:06:45,082 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1959396.0, ans=0.2 2023-06-28 03:07:16,873 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1959516.0, ans=0.0 2023-06-28 03:07:37,235 INFO [train.py:996] (0/4) Epoch 11, batch 21650, loss[loss=0.2028, simple_loss=0.304, pruned_loss=0.05077, over 21813.00 frames. ], tot_loss[loss=0.1987, simple_loss=0.275, pruned_loss=0.06118, over 4263995.23 frames. ], batch size: 282, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 03:07:44,575 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1959576.0, ans=0.125 2023-06-28 03:08:08,191 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.64 vs. limit=15.0 2023-06-28 03:08:26,010 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.567e+02 7.101e+02 1.132e+03 1.604e+03 3.542e+03, threshold=2.263e+03, percent-clipped=14.0 2023-06-28 03:09:18,420 INFO [train.py:996] (0/4) Epoch 11, batch 21700, loss[loss=0.1826, simple_loss=0.2618, pruned_loss=0.05167, over 21797.00 frames. ], tot_loss[loss=0.1973, simple_loss=0.276, pruned_loss=0.0593, over 4255064.17 frames. ], batch size: 124, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 03:09:29,395 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.25 vs. limit=6.0 2023-06-28 03:10:35,881 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1960056.0, ans=0.125 2023-06-28 03:11:00,008 INFO [train.py:996] (0/4) Epoch 11, batch 21750, loss[loss=0.2018, simple_loss=0.261, pruned_loss=0.07128, over 21526.00 frames. ], tot_loss[loss=0.1961, simple_loss=0.2723, pruned_loss=0.05996, over 4263196.42 frames. ], batch size: 442, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 03:11:04,151 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.98 vs. limit=15.0 2023-06-28 03:11:06,994 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1960176.0, ans=0.2 2023-06-28 03:11:43,944 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.642e+02 7.621e+02 1.214e+03 1.880e+03 3.851e+03, threshold=2.427e+03, percent-clipped=16.0 2023-06-28 03:11:45,344 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.97 vs. limit=22.5 2023-06-28 03:12:23,250 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1960416.0, ans=0.125 2023-06-28 03:12:37,219 INFO [train.py:996] (0/4) Epoch 11, batch 21800, loss[loss=0.1811, simple_loss=0.2458, pruned_loss=0.0582, over 21594.00 frames. ], tot_loss[loss=0.1949, simple_loss=0.2691, pruned_loss=0.06037, over 4255217.25 frames. ], batch size: 264, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 03:12:52,277 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.22 vs. limit=22.5 2023-06-28 03:13:30,080 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1960596.0, ans=0.0 2023-06-28 03:13:56,356 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1960716.0, ans=0.2 2023-06-28 03:14:15,423 INFO [train.py:996] (0/4) Epoch 11, batch 21850, loss[loss=0.2161, simple_loss=0.3117, pruned_loss=0.06028, over 21852.00 frames. ], tot_loss[loss=0.1995, simple_loss=0.2758, pruned_loss=0.06156, over 4245523.85 frames. ], batch size: 351, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 03:14:26,493 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.46 vs. limit=15.0 2023-06-28 03:14:46,482 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1960836.0, ans=0.125 2023-06-28 03:14:46,949 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.86 vs. limit=22.5 2023-06-28 03:15:00,556 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.488e+02 6.380e+02 8.991e+02 1.412e+03 2.394e+03, threshold=1.798e+03, percent-clipped=0.0 2023-06-28 03:15:12,564 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1960896.0, ans=0.125 2023-06-28 03:15:32,409 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1961016.0, ans=0.0 2023-06-28 03:15:39,316 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1961016.0, ans=0.125 2023-06-28 03:15:53,000 INFO [train.py:996] (0/4) Epoch 11, batch 21900, loss[loss=0.1724, simple_loss=0.2357, pruned_loss=0.05451, over 21657.00 frames. ], tot_loss[loss=0.2013, simple_loss=0.2767, pruned_loss=0.0629, over 4250304.64 frames. ], batch size: 231, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 03:15:55,366 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1961076.0, ans=0.1 2023-06-28 03:16:12,087 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1961136.0, ans=0.125 2023-06-28 03:16:43,374 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1961196.0, ans=0.0 2023-06-28 03:17:12,448 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1961316.0, ans=0.125 2023-06-28 03:17:26,121 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.55 vs. limit=10.0 2023-06-28 03:17:29,754 INFO [train.py:996] (0/4) Epoch 11, batch 21950, loss[loss=0.1451, simple_loss=0.213, pruned_loss=0.03855, over 16235.00 frames. ], tot_loss[loss=0.198, simple_loss=0.2717, pruned_loss=0.06213, over 4244410.20 frames. ], batch size: 64, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 03:17:32,410 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.95 vs. limit=22.5 2023-06-28 03:17:33,485 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1961376.0, ans=0.125 2023-06-28 03:17:44,970 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1961436.0, ans=0.1 2023-06-28 03:17:48,381 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1961436.0, ans=0.125 2023-06-28 03:18:05,733 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1961436.0, ans=0.0 2023-06-28 03:18:23,016 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.683e+02 5.864e+02 6.968e+02 1.003e+03 1.764e+03, threshold=1.394e+03, percent-clipped=0.0 2023-06-28 03:18:39,302 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1961556.0, ans=0.125 2023-06-28 03:18:50,956 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1961616.0, ans=0.2 2023-06-28 03:19:11,926 INFO [train.py:996] (0/4) Epoch 11, batch 22000, loss[loss=0.1606, simple_loss=0.2294, pruned_loss=0.04588, over 21237.00 frames. ], tot_loss[loss=0.1917, simple_loss=0.2657, pruned_loss=0.0589, over 4256848.48 frames. ], batch size: 159, lr: 2.62e-03, grad_scale: 32.0 2023-06-28 03:19:29,364 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1961736.0, ans=0.1 2023-06-28 03:19:29,479 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1961736.0, ans=0.0 2023-06-28 03:19:32,976 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1961736.0, ans=0.5 2023-06-28 03:19:39,571 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1961736.0, ans=0.125 2023-06-28 03:20:24,241 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1961856.0, ans=0.125 2023-06-28 03:20:55,914 INFO [train.py:996] (0/4) Epoch 11, batch 22050, loss[loss=0.166, simple_loss=0.2338, pruned_loss=0.04914, over 20758.00 frames. ], tot_loss[loss=0.1941, simple_loss=0.2687, pruned_loss=0.05978, over 4247566.85 frames. ], batch size: 608, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 03:20:58,153 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1961976.0, ans=0.035 2023-06-28 03:21:07,111 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.79 vs. limit=15.0 2023-06-28 03:21:08,764 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.04 vs. limit=15.0 2023-06-28 03:21:35,065 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1962036.0, ans=0.125 2023-06-28 03:21:53,083 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.146e+02 7.401e+02 1.317e+03 1.911e+03 4.599e+03, threshold=2.634e+03, percent-clipped=46.0 2023-06-28 03:21:57,372 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1962096.0, ans=0.09899494936611666 2023-06-28 03:22:39,372 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1962276.0, ans=0.1 2023-06-28 03:22:40,209 INFO [train.py:996] (0/4) Epoch 11, batch 22100, loss[loss=0.2578, simple_loss=0.3633, pruned_loss=0.07618, over 19876.00 frames. ], tot_loss[loss=0.2054, simple_loss=0.2814, pruned_loss=0.06472, over 4249395.91 frames. ], batch size: 702, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 03:23:17,275 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1962336.0, ans=0.09899494936611666 2023-06-28 03:23:37,092 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1962396.0, ans=0.125 2023-06-28 03:23:43,535 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1962456.0, ans=0.0 2023-06-28 03:24:05,031 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1962516.0, ans=0.2 2023-06-28 03:24:17,374 INFO [train.py:996] (0/4) Epoch 11, batch 22150, loss[loss=0.2446, simple_loss=0.3146, pruned_loss=0.08726, over 21771.00 frames. ], tot_loss[loss=0.2087, simple_loss=0.2853, pruned_loss=0.06603, over 4261013.18 frames. ], batch size: 441, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 03:24:38,048 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1962636.0, ans=0.125 2023-06-28 03:24:55,882 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1962636.0, ans=0.0 2023-06-28 03:25:13,703 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.995e+02 8.783e+02 1.255e+03 1.849e+03 4.260e+03, threshold=2.511e+03, percent-clipped=3.0 2023-06-28 03:26:00,156 INFO [train.py:996] (0/4) Epoch 11, batch 22200, loss[loss=0.2885, simple_loss=0.3648, pruned_loss=0.1061, over 21639.00 frames. ], tot_loss[loss=0.2118, simple_loss=0.2881, pruned_loss=0.06777, over 4270421.03 frames. ], batch size: 508, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 03:27:41,229 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1963176.0, ans=0.0 2023-06-28 03:27:42,225 INFO [train.py:996] (0/4) Epoch 11, batch 22250, loss[loss=0.224, simple_loss=0.3039, pruned_loss=0.07205, over 21897.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2933, pruned_loss=0.06876, over 4271583.33 frames. ], batch size: 316, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 03:28:08,660 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 03:28:30,099 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1963296.0, ans=0.125 2023-06-28 03:28:37,931 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.198e+02 6.711e+02 8.468e+02 1.239e+03 3.194e+03, threshold=1.694e+03, percent-clipped=5.0 2023-06-28 03:28:38,367 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1963296.0, ans=0.125 2023-06-28 03:29:28,291 INFO [train.py:996] (0/4) Epoch 11, batch 22300, loss[loss=0.2145, simple_loss=0.281, pruned_loss=0.07406, over 21365.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2957, pruned_loss=0.07049, over 4276729.07 frames. ], batch size: 159, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 03:29:46,594 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1963476.0, ans=0.125 2023-06-28 03:30:38,751 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1963656.0, ans=0.125 2023-06-28 03:31:14,591 INFO [train.py:996] (0/4) Epoch 11, batch 22350, loss[loss=0.2345, simple_loss=0.2937, pruned_loss=0.08763, over 21801.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2931, pruned_loss=0.07103, over 4286872.33 frames. ], batch size: 441, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 03:31:27,258 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1963776.0, ans=0.125 2023-06-28 03:31:46,962 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1963836.0, ans=0.0 2023-06-28 03:31:52,805 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.36 vs. limit=10.0 2023-06-28 03:32:01,626 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.484e+02 6.278e+02 9.923e+02 1.351e+03 2.767e+03, threshold=1.985e+03, percent-clipped=14.0 2023-06-28 03:32:12,397 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1963896.0, ans=0.1 2023-06-28 03:32:22,282 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1963956.0, ans=0.0 2023-06-28 03:32:50,487 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1964016.0, ans=0.125 2023-06-28 03:32:58,058 INFO [train.py:996] (0/4) Epoch 11, batch 22400, loss[loss=0.1712, simple_loss=0.2655, pruned_loss=0.03843, over 21635.00 frames. ], tot_loss[loss=0.2142, simple_loss=0.2904, pruned_loss=0.06905, over 4287774.17 frames. ], batch size: 263, lr: 2.62e-03, grad_scale: 32.0 2023-06-28 03:33:08,810 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1964076.0, ans=0.2 2023-06-28 03:33:38,555 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1964196.0, ans=0.125 2023-06-28 03:33:54,981 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1964196.0, ans=0.05 2023-06-28 03:34:03,329 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1964256.0, ans=0.1 2023-06-28 03:34:40,474 INFO [train.py:996] (0/4) Epoch 11, batch 22450, loss[loss=0.1747, simple_loss=0.2428, pruned_loss=0.05332, over 21615.00 frames. ], tot_loss[loss=0.2094, simple_loss=0.284, pruned_loss=0.06741, over 4286881.81 frames. ], batch size: 231, lr: 2.62e-03, grad_scale: 8.0 2023-06-28 03:34:55,998 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1964376.0, ans=0.0 2023-06-28 03:35:05,961 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1964436.0, ans=0.0 2023-06-28 03:35:35,826 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.371e+02 5.963e+02 8.267e+02 1.246e+03 2.225e+03, threshold=1.653e+03, percent-clipped=2.0 2023-06-28 03:36:14,802 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.03 vs. limit=15.0 2023-06-28 03:36:24,032 INFO [train.py:996] (0/4) Epoch 11, batch 22500, loss[loss=0.2754, simple_loss=0.3471, pruned_loss=0.1018, over 21398.00 frames. ], tot_loss[loss=0.2072, simple_loss=0.2797, pruned_loss=0.06737, over 4283700.65 frames. ], batch size: 507, lr: 2.62e-03, grad_scale: 8.0 2023-06-28 03:36:49,326 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1964736.0, ans=0.125 2023-06-28 03:36:54,431 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1964736.0, ans=0.125 2023-06-28 03:37:10,880 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1964796.0, ans=0.2 2023-06-28 03:38:07,195 INFO [train.py:996] (0/4) Epoch 11, batch 22550, loss[loss=0.2077, simple_loss=0.2902, pruned_loss=0.06263, over 21841.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.2827, pruned_loss=0.06785, over 4288940.09 frames. ], batch size: 298, lr: 2.62e-03, grad_scale: 8.0 2023-06-28 03:38:07,930 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1964976.0, ans=0.0 2023-06-28 03:38:40,419 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1965036.0, ans=0.1 2023-06-28 03:39:03,638 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.783e+02 6.887e+02 1.011e+03 1.935e+03 4.167e+03, threshold=2.022e+03, percent-clipped=31.0 2023-06-28 03:39:22,149 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 03:39:25,638 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1965156.0, ans=0.125 2023-06-28 03:39:51,926 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1965216.0, ans=0.2 2023-06-28 03:39:56,246 INFO [train.py:996] (0/4) Epoch 11, batch 22600, loss[loss=0.2073, simple_loss=0.3055, pruned_loss=0.05456, over 20094.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2867, pruned_loss=0.06828, over 4289372.00 frames. ], batch size: 703, lr: 2.62e-03, grad_scale: 8.0 2023-06-28 03:41:07,043 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=19.66 vs. limit=22.5 2023-06-28 03:41:29,429 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.61 vs. limit=15.0 2023-06-28 03:41:33,171 INFO [train.py:996] (0/4) Epoch 11, batch 22650, loss[loss=0.2006, simple_loss=0.2582, pruned_loss=0.07148, over 21103.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2841, pruned_loss=0.06814, over 4271745.22 frames. ], batch size: 159, lr: 2.62e-03, grad_scale: 8.0 2023-06-28 03:41:33,672 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1965576.0, ans=0.125 2023-06-28 03:41:40,268 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1965576.0, ans=0.0 2023-06-28 03:41:45,237 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1965576.0, ans=0.2 2023-06-28 03:42:01,172 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1965636.0, ans=0.0 2023-06-28 03:42:15,118 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.84 vs. limit=15.0 2023-06-28 03:42:26,674 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.890e+02 8.413e+02 1.340e+03 1.745e+03 3.098e+03, threshold=2.679e+03, percent-clipped=14.0 2023-06-28 03:42:27,021 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1965696.0, ans=0.125 2023-06-28 03:43:09,775 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1965816.0, ans=0.0 2023-06-28 03:43:14,258 INFO [train.py:996] (0/4) Epoch 11, batch 22700, loss[loss=0.1839, simple_loss=0.2477, pruned_loss=0.06003, over 21597.00 frames. ], tot_loss[loss=0.2055, simple_loss=0.2776, pruned_loss=0.0667, over 4271592.73 frames. ], batch size: 247, lr: 2.62e-03, grad_scale: 8.0 2023-06-28 03:43:28,413 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1965876.0, ans=0.2 2023-06-28 03:43:46,724 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1965936.0, ans=0.125 2023-06-28 03:43:53,398 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1965996.0, ans=0.125 2023-06-28 03:44:27,799 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1966056.0, ans=0.0 2023-06-28 03:44:56,510 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.39 vs. limit=15.0 2023-06-28 03:44:56,793 INFO [train.py:996] (0/4) Epoch 11, batch 22750, loss[loss=0.2457, simple_loss=0.3056, pruned_loss=0.09291, over 21479.00 frames. ], tot_loss[loss=0.2073, simple_loss=0.2785, pruned_loss=0.06808, over 4275042.88 frames. ], batch size: 194, lr: 2.62e-03, grad_scale: 8.0 2023-06-28 03:45:07,375 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1966176.0, ans=0.1 2023-06-28 03:45:15,536 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1966236.0, ans=0.125 2023-06-28 03:45:27,022 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1966236.0, ans=0.1 2023-06-28 03:45:32,029 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 03:45:55,453 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.898e+02 9.181e+02 1.363e+03 2.029e+03 5.534e+03, threshold=2.727e+03, percent-clipped=14.0 2023-06-28 03:46:38,638 INFO [train.py:996] (0/4) Epoch 11, batch 22800, loss[loss=0.2101, simple_loss=0.2799, pruned_loss=0.07015, over 21703.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2827, pruned_loss=0.07011, over 4284868.27 frames. ], batch size: 391, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 03:46:44,137 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1966476.0, ans=0.125 2023-06-28 03:46:47,338 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1966476.0, ans=0.1 2023-06-28 03:46:54,106 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1966536.0, ans=0.125 2023-06-28 03:47:41,461 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1966656.0, ans=0.125 2023-06-28 03:48:01,467 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1966716.0, ans=0.125 2023-06-28 03:48:17,671 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 03:48:20,560 INFO [train.py:996] (0/4) Epoch 11, batch 22850, loss[loss=0.1943, simple_loss=0.2631, pruned_loss=0.06273, over 21656.00 frames. ], tot_loss[loss=0.209, simple_loss=0.2795, pruned_loss=0.06922, over 4273888.05 frames. ], batch size: 332, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 03:48:29,108 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 03:48:42,487 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1966836.0, ans=0.125 2023-06-28 03:48:54,454 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1966836.0, ans=0.125 2023-06-28 03:49:19,986 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.398e+02 6.821e+02 8.997e+02 1.443e+03 3.960e+03, threshold=1.799e+03, percent-clipped=4.0 2023-06-28 03:49:23,962 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1966956.0, ans=0.2 2023-06-28 03:50:04,228 INFO [train.py:996] (0/4) Epoch 11, batch 22900, loss[loss=0.1868, simple_loss=0.2594, pruned_loss=0.05705, over 21098.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.2822, pruned_loss=0.06904, over 4278421.90 frames. ], batch size: 143, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 03:50:10,807 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.36 vs. limit=15.0 2023-06-28 03:50:25,529 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1967136.0, ans=0.05 2023-06-28 03:50:45,578 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1967136.0, ans=0.0 2023-06-28 03:50:51,947 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1967196.0, ans=0.0 2023-06-28 03:51:34,174 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1967316.0, ans=0.125 2023-06-28 03:51:48,482 INFO [train.py:996] (0/4) Epoch 11, batch 22950, loss[loss=0.1963, simple_loss=0.3159, pruned_loss=0.03841, over 21756.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2932, pruned_loss=0.0675, over 4277286.84 frames. ], batch size: 282, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 03:52:03,692 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1967376.0, ans=0.1 2023-06-28 03:52:22,893 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1967436.0, ans=0.125 2023-06-28 03:52:27,822 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1967436.0, ans=0.125 2023-06-28 03:52:42,079 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.491e+02 7.317e+02 1.405e+03 2.219e+03 4.116e+03, threshold=2.810e+03, percent-clipped=42.0 2023-06-28 03:52:56,428 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.27 vs. limit=12.0 2023-06-28 03:52:57,758 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1967556.0, ans=0.04949747468305833 2023-06-28 03:53:07,631 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1967616.0, ans=0.1 2023-06-28 03:53:20,007 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.67 vs. limit=15.0 2023-06-28 03:53:24,537 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1967676.0, ans=0.125 2023-06-28 03:53:25,447 INFO [train.py:996] (0/4) Epoch 11, batch 23000, loss[loss=0.2131, simple_loss=0.2875, pruned_loss=0.06933, over 21917.00 frames. ], tot_loss[loss=0.212, simple_loss=0.292, pruned_loss=0.06601, over 4281906.18 frames. ], batch size: 333, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 03:54:03,008 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1967736.0, ans=0.125 2023-06-28 03:54:29,768 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1967856.0, ans=0.125 2023-06-28 03:55:11,923 INFO [train.py:996] (0/4) Epoch 11, batch 23050, loss[loss=0.2502, simple_loss=0.3228, pruned_loss=0.08881, over 21583.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2941, pruned_loss=0.06739, over 4278561.60 frames. ], batch size: 415, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 03:55:12,628 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1967976.0, ans=0.2 2023-06-28 03:55:21,570 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-328000.pt 2023-06-28 03:55:30,124 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1967976.0, ans=0.2 2023-06-28 03:55:30,145 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1967976.0, ans=0.125 2023-06-28 03:55:33,415 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 03:55:40,033 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1968036.0, ans=0.125 2023-06-28 03:55:44,908 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1968036.0, ans=0.0 2023-06-28 03:55:54,875 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1968096.0, ans=0.125 2023-06-28 03:56:02,461 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.806e+02 7.903e+02 1.210e+03 1.646e+03 4.576e+03, threshold=2.420e+03, percent-clipped=5.0 2023-06-28 03:56:14,170 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1968156.0, ans=0.0 2023-06-28 03:56:24,420 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1968156.0, ans=0.2 2023-06-28 03:56:54,615 INFO [train.py:996] (0/4) Epoch 11, batch 23100, loss[loss=0.2046, simple_loss=0.2697, pruned_loss=0.06976, over 15685.00 frames. ], tot_loss[loss=0.2123, simple_loss=0.2898, pruned_loss=0.0674, over 4273288.88 frames. ], batch size: 60, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 03:57:14,862 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1968336.0, ans=0.125 2023-06-28 03:57:26,765 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1968336.0, ans=10.0 2023-06-28 03:58:36,206 INFO [train.py:996] (0/4) Epoch 11, batch 23150, loss[loss=0.2123, simple_loss=0.2796, pruned_loss=0.07247, over 21803.00 frames. ], tot_loss[loss=0.2091, simple_loss=0.2845, pruned_loss=0.06682, over 4276746.75 frames. ], batch size: 414, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 03:59:16,529 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1968696.0, ans=0.0 2023-06-28 03:59:20,956 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.155e+02 6.572e+02 9.609e+02 1.447e+03 3.666e+03, threshold=1.922e+03, percent-clipped=4.0 2023-06-28 03:59:26,860 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.56 vs. limit=15.0 2023-06-28 04:00:06,812 INFO [train.py:996] (0/4) Epoch 11, batch 23200, loss[loss=0.2233, simple_loss=0.3136, pruned_loss=0.06645, over 17581.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.2847, pruned_loss=0.06772, over 4281232.78 frames. ], batch size: 60, lr: 2.62e-03, grad_scale: 32.0 2023-06-28 04:00:17,180 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1968876.0, ans=0.0 2023-06-28 04:00:17,620 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.85 vs. limit=22.5 2023-06-28 04:00:45,050 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1968996.0, ans=0.0 2023-06-28 04:00:45,660 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.75 vs. limit=15.0 2023-06-28 04:01:00,125 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1969056.0, ans=0.125 2023-06-28 04:01:38,094 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1969116.0, ans=0.2 2023-06-28 04:01:48,926 INFO [train.py:996] (0/4) Epoch 11, batch 23250, loss[loss=0.209, simple_loss=0.2884, pruned_loss=0.06479, over 19911.00 frames. ], tot_loss[loss=0.2099, simple_loss=0.2842, pruned_loss=0.06778, over 4285221.01 frames. ], batch size: 702, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:02:03,407 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1969176.0, ans=0.0 2023-06-28 04:02:24,894 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.83 vs. limit=15.0 2023-06-28 04:02:41,702 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 04:02:42,554 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.956e+02 7.376e+02 1.130e+03 1.714e+03 3.374e+03, threshold=2.260e+03, percent-clipped=21.0 2023-06-28 04:03:05,285 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1969356.0, ans=0.0 2023-06-28 04:03:33,446 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1969476.0, ans=0.125 2023-06-28 04:03:34,409 INFO [train.py:996] (0/4) Epoch 11, batch 23300, loss[loss=0.2493, simple_loss=0.3623, pruned_loss=0.06819, over 21747.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2894, pruned_loss=0.06908, over 4283643.41 frames. ], batch size: 332, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:03:50,479 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1969536.0, ans=0.125 2023-06-28 04:04:12,347 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1969596.0, ans=0.0 2023-06-28 04:04:17,538 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1969596.0, ans=0.1 2023-06-28 04:04:19,298 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1969596.0, ans=0.2 2023-06-28 04:04:41,295 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1969656.0, ans=0.125 2023-06-28 04:05:13,961 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 04:05:18,314 INFO [train.py:996] (0/4) Epoch 11, batch 23350, loss[loss=0.1565, simple_loss=0.2482, pruned_loss=0.03236, over 21828.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2935, pruned_loss=0.0686, over 4281385.23 frames. ], batch size: 317, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:05:39,037 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1969836.0, ans=0.125 2023-06-28 04:06:01,606 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1969896.0, ans=0.0 2023-06-28 04:06:09,873 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1969896.0, ans=0.125 2023-06-28 04:06:14,290 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.812e+02 6.998e+02 1.084e+03 1.696e+03 4.677e+03, threshold=2.169e+03, percent-clipped=9.0 2023-06-28 04:07:00,208 INFO [train.py:996] (0/4) Epoch 11, batch 23400, loss[loss=0.1592, simple_loss=0.2163, pruned_loss=0.05108, over 20028.00 frames. ], tot_loss[loss=0.2086, simple_loss=0.2862, pruned_loss=0.06549, over 4267742.54 frames. ], batch size: 704, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:07:26,501 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=1970136.0, ans=6.0 2023-06-28 04:07:39,007 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1970196.0, ans=0.1 2023-06-28 04:08:14,789 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1970256.0, ans=0.1 2023-06-28 04:08:14,833 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1970256.0, ans=0.1 2023-06-28 04:08:26,357 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1970316.0, ans=0.125 2023-06-28 04:08:42,810 INFO [train.py:996] (0/4) Epoch 11, batch 23450, loss[loss=0.2824, simple_loss=0.3617, pruned_loss=0.1015, over 21817.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.2872, pruned_loss=0.06602, over 4269352.14 frames. ], batch size: 124, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:08:45,297 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1970376.0, ans=0.125 2023-06-28 04:08:46,877 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1970376.0, ans=0.2 2023-06-28 04:08:55,869 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1970376.0, ans=0.0 2023-06-28 04:09:37,082 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.67 vs. limit=22.5 2023-06-28 04:09:39,102 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.169e+02 9.066e+02 1.305e+03 2.110e+03 3.921e+03, threshold=2.611e+03, percent-clipped=24.0 2023-06-28 04:09:49,417 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1970556.0, ans=0.1 2023-06-28 04:10:06,376 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1970616.0, ans=0.1 2023-06-28 04:10:20,269 INFO [train.py:996] (0/4) Epoch 11, batch 23500, loss[loss=0.2163, simple_loss=0.2896, pruned_loss=0.07147, over 21894.00 frames. ], tot_loss[loss=0.213, simple_loss=0.2897, pruned_loss=0.06811, over 4273409.64 frames. ], batch size: 371, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:10:58,733 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1970736.0, ans=0.2 2023-06-28 04:11:05,296 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1970796.0, ans=0.125 2023-06-28 04:11:05,405 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1970796.0, ans=0.0 2023-06-28 04:11:48,977 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1970916.0, ans=0.125 2023-06-28 04:11:55,931 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1970976.0, ans=0.1 2023-06-28 04:11:56,888 INFO [train.py:996] (0/4) Epoch 11, batch 23550, loss[loss=0.206, simple_loss=0.2609, pruned_loss=0.07551, over 21974.00 frames. ], tot_loss[loss=0.2094, simple_loss=0.2837, pruned_loss=0.06749, over 4276320.30 frames. ], batch size: 375, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:12:04,077 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1970976.0, ans=0.125 2023-06-28 04:12:55,945 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1971096.0, ans=0.1 2023-06-28 04:12:56,969 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.515e+02 7.071e+02 9.804e+02 1.415e+03 2.782e+03, threshold=1.961e+03, percent-clipped=2.0 2023-06-28 04:13:19,574 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=1971216.0, ans=0.05 2023-06-28 04:13:21,735 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.16 vs. limit=6.0 2023-06-28 04:13:33,841 INFO [train.py:996] (0/4) Epoch 11, batch 23600, loss[loss=0.2238, simple_loss=0.3015, pruned_loss=0.07303, over 21509.00 frames. ], tot_loss[loss=0.2099, simple_loss=0.2845, pruned_loss=0.06768, over 4276506.91 frames. ], batch size: 194, lr: 2.62e-03, grad_scale: 32.0 2023-06-28 04:14:29,582 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1971396.0, ans=0.0 2023-06-28 04:14:35,921 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1971396.0, ans=0.0 2023-06-28 04:14:54,214 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 04:14:57,708 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1971456.0, ans=0.125 2023-06-28 04:15:22,096 INFO [train.py:996] (0/4) Epoch 11, batch 23650, loss[loss=0.2924, simple_loss=0.3537, pruned_loss=0.1155, over 21380.00 frames. ], tot_loss[loss=0.209, simple_loss=0.2847, pruned_loss=0.06668, over 4276617.64 frames. ], batch size: 507, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:16:09,746 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1971696.0, ans=0.125 2023-06-28 04:16:14,962 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1971696.0, ans=0.125 2023-06-28 04:16:25,960 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.038e+02 7.663e+02 1.286e+03 2.404e+03 4.690e+03, threshold=2.571e+03, percent-clipped=33.0 2023-06-28 04:16:53,741 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1971816.0, ans=0.0 2023-06-28 04:17:11,306 INFO [train.py:996] (0/4) Epoch 11, batch 23700, loss[loss=0.1829, simple_loss=0.2661, pruned_loss=0.04986, over 21608.00 frames. ], tot_loss[loss=0.2088, simple_loss=0.2855, pruned_loss=0.06607, over 4272112.77 frames. ], batch size: 263, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:17:14,326 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.48 vs. limit=6.0 2023-06-28 04:17:23,776 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1971876.0, ans=0.0 2023-06-28 04:18:36,599 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.94 vs. limit=15.0 2023-06-28 04:18:55,637 INFO [train.py:996] (0/4) Epoch 11, batch 23750, loss[loss=0.2155, simple_loss=0.3112, pruned_loss=0.05994, over 21463.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.2889, pruned_loss=0.06701, over 4280229.21 frames. ], batch size: 471, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:19:10,160 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1972176.0, ans=0.125 2023-06-28 04:19:16,914 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1972176.0, ans=0.125 2023-06-28 04:19:37,719 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1972296.0, ans=0.1 2023-06-28 04:19:41,254 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1972296.0, ans=0.1 2023-06-28 04:19:56,754 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1972296.0, ans=0.125 2023-06-28 04:19:59,569 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.432e+02 7.586e+02 1.231e+03 1.988e+03 4.114e+03, threshold=2.463e+03, percent-clipped=17.0 2023-06-28 04:20:49,619 INFO [train.py:996] (0/4) Epoch 11, batch 23800, loss[loss=0.2198, simple_loss=0.3035, pruned_loss=0.06807, over 21229.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.2892, pruned_loss=0.06629, over 4275959.78 frames. ], batch size: 159, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:21:32,980 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.12 vs. limit=15.0 2023-06-28 04:21:39,343 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1972596.0, ans=0.125 2023-06-28 04:21:55,272 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.56 vs. limit=10.0 2023-06-28 04:22:30,795 INFO [train.py:996] (0/4) Epoch 11, batch 23850, loss[loss=0.2266, simple_loss=0.312, pruned_loss=0.07054, over 21956.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.2971, pruned_loss=0.06816, over 4277788.52 frames. ], batch size: 317, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:23:30,394 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.838e+02 1.015e+03 1.727e+03 2.965e+03 4.931e+03, threshold=3.454e+03, percent-clipped=27.0 2023-06-28 04:23:34,345 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1972956.0, ans=0.025 2023-06-28 04:24:14,061 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1973076.0, ans=0.1 2023-06-28 04:24:14,930 INFO [train.py:996] (0/4) Epoch 11, batch 23900, loss[loss=0.1883, simple_loss=0.2741, pruned_loss=0.05129, over 20752.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.3026, pruned_loss=0.06923, over 4282573.79 frames. ], batch size: 607, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:24:40,024 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1973136.0, ans=0.0 2023-06-28 04:24:51,744 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1973196.0, ans=0.1 2023-06-28 04:25:42,846 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 04:25:57,212 INFO [train.py:996] (0/4) Epoch 11, batch 23950, loss[loss=0.2148, simple_loss=0.2862, pruned_loss=0.07172, over 21358.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.3004, pruned_loss=0.06895, over 4272598.82 frames. ], batch size: 211, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:26:26,543 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1973436.0, ans=0.125 2023-06-28 04:26:37,515 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.97 vs. limit=15.0 2023-06-28 04:27:01,133 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.639e+02 7.884e+02 1.240e+03 1.758e+03 3.648e+03, threshold=2.481e+03, percent-clipped=1.0 2023-06-28 04:27:08,925 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.25 vs. limit=15.0 2023-06-28 04:27:14,829 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1973556.0, ans=0.125 2023-06-28 04:27:16,427 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1973556.0, ans=0.0 2023-06-28 04:27:18,138 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1973556.0, ans=0.0 2023-06-28 04:27:27,924 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 04:27:40,596 INFO [train.py:996] (0/4) Epoch 11, batch 24000, loss[loss=0.2278, simple_loss=0.3001, pruned_loss=0.07778, over 21998.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.3003, pruned_loss=0.07094, over 4276939.87 frames. ], batch size: 317, lr: 2.62e-03, grad_scale: 32.0 2023-06-28 04:27:40,597 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-28 04:28:01,236 INFO [train.py:1028] (0/4) Epoch 11, validation: loss=0.2606, simple_loss=0.3539, pruned_loss=0.08365, over 1796401.00 frames. 2023-06-28 04:28:01,237 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23714MB 2023-06-28 04:29:06,062 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1973856.0, ans=0.125 2023-06-28 04:29:24,737 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1973916.0, ans=0.1 2023-06-28 04:29:45,821 INFO [train.py:996] (0/4) Epoch 11, batch 24050, loss[loss=0.2665, simple_loss=0.3401, pruned_loss=0.09642, over 21446.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.3029, pruned_loss=0.07238, over 4279745.11 frames. ], batch size: 508, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:30:09,182 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.71 vs. limit=22.5 2023-06-28 04:30:24,417 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1974036.0, ans=0.1 2023-06-28 04:30:30,793 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1974096.0, ans=0.125 2023-06-28 04:30:41,098 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.71 vs. limit=12.0 2023-06-28 04:30:50,228 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.159e+02 7.180e+02 1.052e+03 1.636e+03 2.739e+03, threshold=2.104e+03, percent-clipped=1.0 2023-06-28 04:30:51,704 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.65 vs. limit=15.0 2023-06-28 04:31:09,870 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1974216.0, ans=0.125 2023-06-28 04:31:11,424 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1974216.0, ans=0.0 2023-06-28 04:31:13,182 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1974216.0, ans=0.1 2023-06-28 04:31:19,505 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1974216.0, ans=0.1 2023-06-28 04:31:33,819 INFO [train.py:996] (0/4) Epoch 11, batch 24100, loss[loss=0.2236, simple_loss=0.3102, pruned_loss=0.06849, over 21859.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.3027, pruned_loss=0.07128, over 4275290.02 frames. ], batch size: 282, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:32:03,916 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.18 vs. limit=15.0 2023-06-28 04:32:19,476 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1974396.0, ans=0.125 2023-06-28 04:32:51,173 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1974516.0, ans=0.2 2023-06-28 04:33:14,909 INFO [train.py:996] (0/4) Epoch 11, batch 24150, loss[loss=0.2357, simple_loss=0.3168, pruned_loss=0.07729, over 21871.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.3035, pruned_loss=0.07262, over 4284904.56 frames. ], batch size: 107, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:33:15,546 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1974576.0, ans=0.125 2023-06-28 04:33:19,064 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1974576.0, ans=0.1 2023-06-28 04:33:53,355 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1974636.0, ans=0.09899494936611666 2023-06-28 04:34:14,498 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.271e+02 8.001e+02 1.203e+03 1.842e+03 3.600e+03, threshold=2.405e+03, percent-clipped=13.0 2023-06-28 04:34:19,326 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.99 vs. limit=6.0 2023-06-28 04:34:20,402 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1974756.0, ans=0.0 2023-06-28 04:34:39,105 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1974816.0, ans=0.0 2023-06-28 04:34:49,062 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1974816.0, ans=0.0 2023-06-28 04:34:58,328 INFO [train.py:996] (0/4) Epoch 11, batch 24200, loss[loss=0.2786, simple_loss=0.3585, pruned_loss=0.09932, over 21610.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.3058, pruned_loss=0.07423, over 4286148.28 frames. ], batch size: 441, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:35:17,625 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1974876.0, ans=0.1 2023-06-28 04:35:26,255 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.89 vs. limit=15.0 2023-06-28 04:36:47,566 INFO [train.py:996] (0/4) Epoch 11, batch 24250, loss[loss=0.1824, simple_loss=0.272, pruned_loss=0.04643, over 21371.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.3032, pruned_loss=0.06928, over 4289453.97 frames. ], batch size: 211, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:36:59,758 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.64 vs. limit=6.0 2023-06-28 04:37:21,978 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1975236.0, ans=0.125 2023-06-28 04:37:48,142 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.666e+02 6.261e+02 9.348e+02 1.527e+03 2.867e+03, threshold=1.870e+03, percent-clipped=6.0 2023-06-28 04:38:07,066 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1975356.0, ans=0.125 2023-06-28 04:38:35,070 INFO [train.py:996] (0/4) Epoch 11, batch 24300, loss[loss=0.1537, simple_loss=0.2332, pruned_loss=0.03704, over 21521.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.2979, pruned_loss=0.06416, over 4287323.02 frames. ], batch size: 211, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:38:39,436 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.03 vs. limit=15.0 2023-06-28 04:38:42,116 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1975476.0, ans=0.0 2023-06-28 04:39:49,175 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.03 vs. limit=22.5 2023-06-28 04:40:12,593 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1975716.0, ans=0.1 2023-06-28 04:40:16,281 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.81 vs. limit=15.0 2023-06-28 04:40:16,734 INFO [train.py:996] (0/4) Epoch 11, batch 24350, loss[loss=0.1601, simple_loss=0.2393, pruned_loss=0.04042, over 21634.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2946, pruned_loss=0.06388, over 4293097.13 frames. ], batch size: 230, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:41:16,492 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.173e+02 7.216e+02 1.198e+03 1.667e+03 3.137e+03, threshold=2.397e+03, percent-clipped=16.0 2023-06-28 04:41:50,283 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1976016.0, ans=0.125 2023-06-28 04:41:53,546 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1976016.0, ans=0.125 2023-06-28 04:41:58,606 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1976076.0, ans=0.0 2023-06-28 04:41:59,547 INFO [train.py:996] (0/4) Epoch 11, batch 24400, loss[loss=0.1991, simple_loss=0.2903, pruned_loss=0.05402, over 21816.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2963, pruned_loss=0.06633, over 4293564.01 frames. ], batch size: 282, lr: 2.62e-03, grad_scale: 32.0 2023-06-28 04:43:21,542 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1976256.0, ans=0.125 2023-06-28 04:43:42,610 INFO [train.py:996] (0/4) Epoch 11, batch 24450, loss[loss=0.2096, simple_loss=0.2869, pruned_loss=0.06613, over 21244.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2959, pruned_loss=0.06731, over 4282856.41 frames. ], batch size: 159, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 04:44:30,689 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1976496.0, ans=0.125 2023-06-28 04:44:48,230 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.546e+02 6.657e+02 8.727e+02 1.270e+03 2.887e+03, threshold=1.745e+03, percent-clipped=2.0 2023-06-28 04:44:50,538 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1976556.0, ans=0.0 2023-06-28 04:45:24,281 INFO [train.py:996] (0/4) Epoch 11, batch 24500, loss[loss=0.2354, simple_loss=0.3026, pruned_loss=0.08415, over 21618.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2959, pruned_loss=0.06736, over 4289196.67 frames. ], batch size: 471, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 04:45:29,735 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1976676.0, ans=0.125 2023-06-28 04:46:56,658 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.06 vs. limit=6.0 2023-06-28 04:47:07,087 INFO [train.py:996] (0/4) Epoch 11, batch 24550, loss[loss=0.2389, simple_loss=0.3226, pruned_loss=0.07758, over 21570.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2988, pruned_loss=0.06898, over 4284491.02 frames. ], batch size: 414, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 04:47:29,508 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1977036.0, ans=0.125 2023-06-28 04:48:18,393 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.753e+02 7.977e+02 1.391e+03 1.923e+03 3.873e+03, threshold=2.782e+03, percent-clipped=31.0 2023-06-28 04:48:27,092 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1977156.0, ans=0.2 2023-06-28 04:48:42,249 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1977216.0, ans=0.125 2023-06-28 04:48:44,423 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.85 vs. limit=15.0 2023-06-28 04:48:54,435 INFO [train.py:996] (0/4) Epoch 11, batch 24600, loss[loss=0.196, simple_loss=0.253, pruned_loss=0.06953, over 21258.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.2969, pruned_loss=0.07003, over 4277204.03 frames. ], batch size: 176, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 04:49:17,859 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=1977336.0, ans=0.025 2023-06-28 04:49:21,221 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1977336.0, ans=0.125 2023-06-28 04:50:05,838 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1977456.0, ans=0.125 2023-06-28 04:50:11,067 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1977456.0, ans=0.0 2023-06-28 04:50:21,133 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1977516.0, ans=0.0 2023-06-28 04:50:26,626 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=1977516.0, ans=22.5 2023-06-28 04:50:37,072 INFO [train.py:996] (0/4) Epoch 11, batch 24650, loss[loss=0.1744, simple_loss=0.2361, pruned_loss=0.05632, over 21321.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2913, pruned_loss=0.06864, over 4265105.13 frames. ], batch size: 144, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 04:50:42,948 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1977576.0, ans=0.125 2023-06-28 04:51:07,570 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1977636.0, ans=0.0 2023-06-28 04:51:34,022 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.54 vs. limit=22.5 2023-06-28 04:51:42,517 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.777e+02 8.360e+02 1.097e+03 1.550e+03 2.969e+03, threshold=2.194e+03, percent-clipped=2.0 2023-06-28 04:52:19,285 INFO [train.py:996] (0/4) Epoch 11, batch 24700, loss[loss=0.1825, simple_loss=0.2579, pruned_loss=0.05354, over 21381.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.2876, pruned_loss=0.06678, over 4263236.38 frames. ], batch size: 211, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 04:54:01,985 INFO [train.py:996] (0/4) Epoch 11, batch 24750, loss[loss=0.1721, simple_loss=0.2296, pruned_loss=0.05729, over 20725.00 frames. ], tot_loss[loss=0.2045, simple_loss=0.2802, pruned_loss=0.06439, over 4265710.72 frames. ], batch size: 607, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 04:55:07,374 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.038e+02 5.891e+02 8.003e+02 1.099e+03 2.127e+03, threshold=1.601e+03, percent-clipped=0.0 2023-06-28 04:55:18,755 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.26 vs. limit=15.0 2023-06-28 04:55:21,844 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.97 vs. limit=15.0 2023-06-28 04:55:38,487 INFO [train.py:996] (0/4) Epoch 11, batch 24800, loss[loss=0.1926, simple_loss=0.2464, pruned_loss=0.06939, over 20070.00 frames. ], tot_loss[loss=0.2025, simple_loss=0.2759, pruned_loss=0.06454, over 4270854.59 frames. ], batch size: 703, lr: 2.61e-03, grad_scale: 32.0 2023-06-28 04:57:17,839 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1978716.0, ans=0.125 2023-06-28 04:57:22,247 INFO [train.py:996] (0/4) Epoch 11, batch 24850, loss[loss=0.1782, simple_loss=0.2514, pruned_loss=0.05248, over 21548.00 frames. ], tot_loss[loss=0.2036, simple_loss=0.2755, pruned_loss=0.06587, over 4279118.39 frames. ], batch size: 230, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 04:58:20,576 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1978896.0, ans=0.2 2023-06-28 04:58:24,122 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1978896.0, ans=0.125 2023-06-28 04:58:29,236 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1978896.0, ans=0.0 2023-06-28 04:58:35,357 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.508e+02 8.527e+02 1.164e+03 1.873e+03 3.084e+03, threshold=2.328e+03, percent-clipped=28.0 2023-06-28 04:58:41,056 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1978956.0, ans=0.09899494936611666 2023-06-28 04:58:59,228 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1979016.0, ans=0.0 2023-06-28 04:59:09,790 INFO [train.py:996] (0/4) Epoch 11, batch 24900, loss[loss=0.2269, simple_loss=0.2915, pruned_loss=0.08118, over 21274.00 frames. ], tot_loss[loss=0.2056, simple_loss=0.278, pruned_loss=0.06663, over 4284003.65 frames. ], batch size: 143, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 04:59:35,412 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1979136.0, ans=0.2 2023-06-28 04:59:54,533 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1979136.0, ans=0.125 2023-06-28 05:00:09,791 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1979196.0, ans=0.1 2023-06-28 05:00:13,211 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1979256.0, ans=0.1 2023-06-28 05:00:27,212 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1979256.0, ans=0.125 2023-06-28 05:00:32,227 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1979316.0, ans=0.0 2023-06-28 05:00:37,332 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1979316.0, ans=0.125 2023-06-28 05:00:58,720 INFO [train.py:996] (0/4) Epoch 11, batch 24950, loss[loss=0.2969, simple_loss=0.3556, pruned_loss=0.1191, over 21391.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2855, pruned_loss=0.07005, over 4285204.16 frames. ], batch size: 471, lr: 2.61e-03, grad_scale: 8.0 2023-06-28 05:01:43,084 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1979496.0, ans=0.05 2023-06-28 05:01:48,324 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1979496.0, ans=0.125 2023-06-28 05:02:04,368 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.368e+02 8.687e+02 1.291e+03 2.049e+03 3.753e+03, threshold=2.582e+03, percent-clipped=19.0 2023-06-28 05:02:42,354 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=19.41 vs. limit=22.5 2023-06-28 05:02:42,797 INFO [train.py:996] (0/4) Epoch 11, batch 25000, loss[loss=0.1815, simple_loss=0.2482, pruned_loss=0.05738, over 21278.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2915, pruned_loss=0.07178, over 4278080.71 frames. ], batch size: 176, lr: 2.61e-03, grad_scale: 8.0 2023-06-28 05:03:13,595 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1979736.0, ans=0.0 2023-06-28 05:03:21,824 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1979736.0, ans=0.0 2023-06-28 05:03:21,985 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1979736.0, ans=0.125 2023-06-28 05:03:43,760 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1979856.0, ans=0.2 2023-06-28 05:03:56,329 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1979856.0, ans=0.2 2023-06-28 05:04:02,122 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=1979856.0, ans=15.0 2023-06-28 05:04:21,363 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1979916.0, ans=0.125 2023-06-28 05:04:21,458 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1979916.0, ans=0.125 2023-06-28 05:04:25,858 INFO [train.py:996] (0/4) Epoch 11, batch 25050, loss[loss=0.2206, simple_loss=0.2806, pruned_loss=0.08027, over 21268.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2873, pruned_loss=0.07049, over 4273491.86 frames. ], batch size: 144, lr: 2.61e-03, grad_scale: 8.0 2023-06-28 05:04:54,556 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1980036.0, ans=0.0 2023-06-28 05:05:03,098 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1980036.0, ans=0.125 2023-06-28 05:05:10,135 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1980096.0, ans=0.125 2023-06-28 05:05:18,702 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1980096.0, ans=0.125 2023-06-28 05:05:30,984 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1980156.0, ans=0.125 2023-06-28 05:05:37,088 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.870e+02 6.206e+02 8.703e+02 1.312e+03 2.418e+03, threshold=1.741e+03, percent-clipped=0.0 2023-06-28 05:06:09,896 INFO [train.py:996] (0/4) Epoch 11, batch 25100, loss[loss=0.1849, simple_loss=0.2578, pruned_loss=0.05605, over 21839.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2821, pruned_loss=0.06937, over 4270126.78 frames. ], batch size: 107, lr: 2.61e-03, grad_scale: 8.0 2023-06-28 05:06:10,456 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 05:06:56,073 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1980396.0, ans=0.125 2023-06-28 05:07:29,386 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1980516.0, ans=0.1 2023-06-28 05:07:51,376 INFO [train.py:996] (0/4) Epoch 11, batch 25150, loss[loss=0.1842, simple_loss=0.2778, pruned_loss=0.04523, over 21821.00 frames. ], tot_loss[loss=0.2086, simple_loss=0.2828, pruned_loss=0.06723, over 4258093.84 frames. ], batch size: 332, lr: 2.61e-03, grad_scale: 8.0 2023-06-28 05:08:10,309 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1980576.0, ans=0.0 2023-06-28 05:08:55,624 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.446e+02 6.553e+02 1.065e+03 1.530e+03 2.529e+03, threshold=2.131e+03, percent-clipped=15.0 2023-06-28 05:09:21,211 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1980816.0, ans=0.2 2023-06-28 05:09:24,571 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1980816.0, ans=0.1 2023-06-28 05:09:28,757 INFO [train.py:996] (0/4) Epoch 11, batch 25200, loss[loss=0.1762, simple_loss=0.2628, pruned_loss=0.04482, over 21448.00 frames. ], tot_loss[loss=0.2065, simple_loss=0.2826, pruned_loss=0.06518, over 4266036.68 frames. ], batch size: 131, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:10:15,207 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1980996.0, ans=0.0 2023-06-28 05:10:59,878 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1981116.0, ans=0.0 2023-06-28 05:11:01,387 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1981116.0, ans=0.2 2023-06-28 05:11:10,880 INFO [train.py:996] (0/4) Epoch 11, batch 25250, loss[loss=0.1831, simple_loss=0.2645, pruned_loss=0.05087, over 21772.00 frames. ], tot_loss[loss=0.2044, simple_loss=0.2812, pruned_loss=0.06383, over 4267845.28 frames. ], batch size: 371, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:11:19,495 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1981176.0, ans=0.07 2023-06-28 05:11:22,000 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.01 vs. limit=6.0 2023-06-28 05:12:21,393 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.599e+02 7.557e+02 1.172e+03 1.779e+03 3.738e+03, threshold=2.344e+03, percent-clipped=14.0 2023-06-28 05:12:38,451 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1981416.0, ans=0.0 2023-06-28 05:12:59,823 INFO [train.py:996] (0/4) Epoch 11, batch 25300, loss[loss=0.2131, simple_loss=0.299, pruned_loss=0.06364, over 21663.00 frames. ], tot_loss[loss=0.2029, simple_loss=0.2798, pruned_loss=0.063, over 4256911.11 frames. ], batch size: 351, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:13:03,682 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 05:13:08,688 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1981476.0, ans=0.04949747468305833 2023-06-28 05:13:49,522 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1981596.0, ans=0.125 2023-06-28 05:13:58,393 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.08 vs. limit=15.0 2023-06-28 05:14:12,609 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1981656.0, ans=0.1 2023-06-28 05:14:44,496 INFO [train.py:996] (0/4) Epoch 11, batch 25350, loss[loss=0.1744, simple_loss=0.2577, pruned_loss=0.04552, over 21567.00 frames. ], tot_loss[loss=0.2025, simple_loss=0.2813, pruned_loss=0.06186, over 4261729.95 frames. ], batch size: 263, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:14:54,547 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1981776.0, ans=0.2 2023-06-28 05:14:58,604 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.60 vs. limit=6.0 2023-06-28 05:15:09,278 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1981836.0, ans=0.125 2023-06-28 05:15:53,137 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.507e+02 7.550e+02 1.200e+03 1.857e+03 4.350e+03, threshold=2.399e+03, percent-clipped=14.0 2023-06-28 05:15:56,136 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.34 vs. limit=8.0 2023-06-28 05:16:01,006 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.58 vs. limit=12.0 2023-06-28 05:16:01,035 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.30 vs. limit=10.0 2023-06-28 05:16:17,821 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1982016.0, ans=0.5 2023-06-28 05:16:25,269 INFO [train.py:996] (0/4) Epoch 11, batch 25400, loss[loss=0.1908, simple_loss=0.2564, pruned_loss=0.0626, over 21498.00 frames. ], tot_loss[loss=0.2002, simple_loss=0.2783, pruned_loss=0.06108, over 4261296.32 frames. ], batch size: 230, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:16:54,928 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1982136.0, ans=0.125 2023-06-28 05:16:56,563 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1982136.0, ans=0.0 2023-06-28 05:18:07,452 INFO [train.py:996] (0/4) Epoch 11, batch 25450, loss[loss=0.1842, simple_loss=0.2757, pruned_loss=0.04632, over 21708.00 frames. ], tot_loss[loss=0.2019, simple_loss=0.2793, pruned_loss=0.06228, over 4272386.35 frames. ], batch size: 247, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:18:08,836 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.64 vs. limit=22.5 2023-06-28 05:18:24,154 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.81 vs. limit=10.0 2023-06-28 05:18:56,626 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1982496.0, ans=0.0 2023-06-28 05:19:05,722 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.28 vs. limit=6.0 2023-06-28 05:19:17,819 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.365e+02 6.795e+02 1.021e+03 1.795e+03 3.141e+03, threshold=2.041e+03, percent-clipped=7.0 2023-06-28 05:19:56,328 INFO [train.py:996] (0/4) Epoch 11, batch 25500, loss[loss=0.2091, simple_loss=0.2982, pruned_loss=0.06002, over 21777.00 frames. ], tot_loss[loss=0.1993, simple_loss=0.2787, pruned_loss=0.05995, over 4261756.64 frames. ], batch size: 351, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:21:05,428 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1982856.0, ans=0.2 2023-06-28 05:21:39,901 INFO [train.py:996] (0/4) Epoch 11, batch 25550, loss[loss=0.2344, simple_loss=0.3391, pruned_loss=0.06488, over 21334.00 frames. ], tot_loss[loss=0.2028, simple_loss=0.2851, pruned_loss=0.06024, over 4267327.92 frames. ], batch size: 548, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:22:09,007 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.45 vs. limit=22.5 2023-06-28 05:22:34,760 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 05:22:38,337 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1983156.0, ans=0.1 2023-06-28 05:22:39,899 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1983156.0, ans=0.125 2023-06-28 05:22:44,326 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.429e+02 7.335e+02 1.015e+03 1.599e+03 3.312e+03, threshold=2.031e+03, percent-clipped=14.0 2023-06-28 05:23:28,338 INFO [train.py:996] (0/4) Epoch 11, batch 25600, loss[loss=0.2048, simple_loss=0.2838, pruned_loss=0.06291, over 21817.00 frames. ], tot_loss[loss=0.2044, simple_loss=0.288, pruned_loss=0.06037, over 4255061.39 frames. ], batch size: 102, lr: 2.61e-03, grad_scale: 32.0 2023-06-28 05:24:13,904 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1983396.0, ans=0.125 2023-06-28 05:24:28,752 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1983456.0, ans=0.1 2023-06-28 05:24:41,794 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 05:24:48,731 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.34 vs. limit=15.0 2023-06-28 05:25:10,636 INFO [train.py:996] (0/4) Epoch 11, batch 25650, loss[loss=0.204, simple_loss=0.2774, pruned_loss=0.06524, over 21435.00 frames. ], tot_loss[loss=0.2079, simple_loss=0.2892, pruned_loss=0.06332, over 4247225.66 frames. ], batch size: 131, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:25:36,418 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1983636.0, ans=0.2 2023-06-28 05:25:48,345 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.54 vs. limit=22.5 2023-06-28 05:25:56,540 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=1983696.0, ans=15.0 2023-06-28 05:25:57,859 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1983696.0, ans=0.0 2023-06-28 05:26:21,288 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.706e+02 6.784e+02 1.002e+03 1.536e+03 3.689e+03, threshold=2.004e+03, percent-clipped=11.0 2023-06-28 05:26:26,957 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1983756.0, ans=0.125 2023-06-28 05:26:29,154 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.43 vs. limit=22.5 2023-06-28 05:26:52,938 INFO [train.py:996] (0/4) Epoch 11, batch 25700, loss[loss=0.1861, simple_loss=0.2576, pruned_loss=0.05729, over 21752.00 frames. ], tot_loss[loss=0.2072, simple_loss=0.2855, pruned_loss=0.0644, over 4243717.17 frames. ], batch size: 282, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:27:10,091 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1983936.0, ans=0.0 2023-06-28 05:27:39,892 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1983996.0, ans=0.1 2023-06-28 05:28:32,205 INFO [train.py:996] (0/4) Epoch 11, batch 25750, loss[loss=0.2319, simple_loss=0.3039, pruned_loss=0.07992, over 21333.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.2905, pruned_loss=0.06718, over 4258642.38 frames. ], batch size: 548, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:28:33,606 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.42 vs. limit=15.0 2023-06-28 05:28:39,840 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1984176.0, ans=0.2 2023-06-28 05:28:53,429 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1984236.0, ans=0.0 2023-06-28 05:29:46,251 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1984356.0, ans=0.1 2023-06-28 05:29:47,948 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1984356.0, ans=0.2 2023-06-28 05:29:50,526 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.299e+02 8.292e+02 1.215e+03 2.235e+03 4.745e+03, threshold=2.430e+03, percent-clipped=27.0 2023-06-28 05:30:07,589 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1984416.0, ans=0.1 2023-06-28 05:30:23,503 INFO [train.py:996] (0/4) Epoch 11, batch 25800, loss[loss=0.2776, simple_loss=0.3539, pruned_loss=0.1006, over 21755.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2993, pruned_loss=0.07066, over 4257656.94 frames. ], batch size: 441, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:31:48,338 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=1984716.0, ans=0.025 2023-06-28 05:31:57,257 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.69 vs. limit=15.0 2023-06-28 05:32:06,386 INFO [train.py:996] (0/4) Epoch 11, batch 25850, loss[loss=0.2245, simple_loss=0.2997, pruned_loss=0.07466, over 21822.00 frames. ], tot_loss[loss=0.2219, simple_loss=0.302, pruned_loss=0.07091, over 4262697.59 frames. ], batch size: 118, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:32:33,557 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1984836.0, ans=0.1 2023-06-28 05:32:41,841 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1984836.0, ans=0.125 2023-06-28 05:33:02,633 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1984896.0, ans=0.125 2023-06-28 05:33:18,847 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.240e+02 7.750e+02 1.095e+03 1.413e+03 4.702e+03, threshold=2.190e+03, percent-clipped=3.0 2023-06-28 05:33:21,874 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.85 vs. limit=15.0 2023-06-28 05:33:45,979 INFO [train.py:996] (0/4) Epoch 11, batch 25900, loss[loss=0.2497, simple_loss=0.3415, pruned_loss=0.07892, over 21821.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.3032, pruned_loss=0.0715, over 4267323.45 frames. ], batch size: 316, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:34:05,407 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1985076.0, ans=0.125 2023-06-28 05:34:13,269 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1985136.0, ans=0.0 2023-06-28 05:34:20,249 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1985136.0, ans=0.0 2023-06-28 05:34:47,039 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 05:35:07,132 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1985316.0, ans=0.125 2023-06-28 05:35:07,638 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.75 vs. limit=15.0 2023-06-28 05:35:16,956 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1985316.0, ans=0.015 2023-06-28 05:35:18,739 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1985316.0, ans=0.125 2023-06-28 05:35:19,534 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.37 vs. limit=6.0 2023-06-28 05:35:29,667 INFO [train.py:996] (0/4) Epoch 11, batch 25950, loss[loss=0.2364, simple_loss=0.3236, pruned_loss=0.0746, over 21788.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3126, pruned_loss=0.07568, over 4276554.03 frames. ], batch size: 124, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:35:51,504 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1985376.0, ans=0.1 2023-06-28 05:36:41,748 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.536e+02 7.393e+02 8.893e+02 1.407e+03 4.224e+03, threshold=1.779e+03, percent-clipped=8.0 2023-06-28 05:37:01,237 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1985616.0, ans=0.2 2023-06-28 05:37:18,842 INFO [train.py:996] (0/4) Epoch 11, batch 26000, loss[loss=0.2335, simple_loss=0.3183, pruned_loss=0.0744, over 21714.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.312, pruned_loss=0.07374, over 4277554.20 frames. ], batch size: 298, lr: 2.61e-03, grad_scale: 32.0 2023-06-28 05:37:30,261 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=1985676.0, ans=22.5 2023-06-28 05:37:34,512 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1985676.0, ans=0.0 2023-06-28 05:37:54,141 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1985736.0, ans=0.125 2023-06-28 05:38:11,692 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.17 vs. limit=22.5 2023-06-28 05:38:45,682 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 05:38:53,768 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1985916.0, ans=0.125 2023-06-28 05:39:00,975 INFO [train.py:996] (0/4) Epoch 11, batch 26050, loss[loss=0.2016, simple_loss=0.2729, pruned_loss=0.06512, over 21860.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.3117, pruned_loss=0.07422, over 4271715.43 frames. ], batch size: 282, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:39:55,502 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1986156.0, ans=0.5 2023-06-28 05:40:03,333 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.740e+02 6.927e+02 9.303e+02 1.315e+03 2.564e+03, threshold=1.861e+03, percent-clipped=11.0 2023-06-28 05:40:37,545 INFO [train.py:996] (0/4) Epoch 11, batch 26100, loss[loss=0.2181, simple_loss=0.2795, pruned_loss=0.0784, over 21619.00 frames. ], tot_loss[loss=0.227, simple_loss=0.3064, pruned_loss=0.07381, over 4281346.48 frames. ], batch size: 195, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:41:01,199 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1986336.0, ans=0.2 2023-06-28 05:41:16,583 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1986336.0, ans=0.2 2023-06-28 05:41:34,249 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.69 vs. limit=15.0 2023-06-28 05:42:00,345 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.42 vs. limit=10.0 2023-06-28 05:42:10,106 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1986516.0, ans=0.125 2023-06-28 05:42:22,247 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.27 vs. limit=6.0 2023-06-28 05:42:25,872 INFO [train.py:996] (0/4) Epoch 11, batch 26150, loss[loss=0.2385, simple_loss=0.3097, pruned_loss=0.08366, over 21366.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.3026, pruned_loss=0.07313, over 4287322.98 frames. ], batch size: 548, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:42:54,575 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.61 vs. limit=15.0 2023-06-28 05:43:40,783 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.470e+02 6.839e+02 9.051e+02 1.314e+03 2.834e+03, threshold=1.810e+03, percent-clipped=6.0 2023-06-28 05:44:10,838 INFO [train.py:996] (0/4) Epoch 11, batch 26200, loss[loss=0.2289, simple_loss=0.3359, pruned_loss=0.06091, over 21778.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.3034, pruned_loss=0.07116, over 4287590.95 frames. ], batch size: 351, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:44:39,019 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.44 vs. limit=22.5 2023-06-28 05:44:54,938 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1986996.0, ans=0.125 2023-06-28 05:45:17,002 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1987056.0, ans=0.2 2023-06-28 05:45:21,904 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1987056.0, ans=0.125 2023-06-28 05:45:49,438 INFO [train.py:996] (0/4) Epoch 11, batch 26250, loss[loss=0.2287, simple_loss=0.3106, pruned_loss=0.07334, over 21898.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.3075, pruned_loss=0.07069, over 4286456.58 frames. ], batch size: 414, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:46:15,972 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1987236.0, ans=0.04949747468305833 2023-06-28 05:46:21,169 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1987236.0, ans=0.2 2023-06-28 05:47:01,913 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.966e+02 7.302e+02 1.108e+03 1.607e+03 4.168e+03, threshold=2.217e+03, percent-clipped=19.0 2023-06-28 05:47:13,897 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1987416.0, ans=0.125 2023-06-28 05:47:31,777 INFO [train.py:996] (0/4) Epoch 11, batch 26300, loss[loss=0.2158, simple_loss=0.2976, pruned_loss=0.06704, over 21912.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.305, pruned_loss=0.07114, over 4287801.76 frames. ], batch size: 118, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:47:50,713 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1987476.0, ans=0.125 2023-06-28 05:48:26,274 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.04 vs. limit=15.0 2023-06-28 05:48:49,707 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.64 vs. limit=22.5 2023-06-28 05:49:05,619 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1987716.0, ans=0.125 2023-06-28 05:49:16,687 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1987716.0, ans=0.125 2023-06-28 05:49:18,370 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1987776.0, ans=0.125 2023-06-28 05:49:19,387 INFO [train.py:996] (0/4) Epoch 11, batch 26350, loss[loss=0.2507, simple_loss=0.3276, pruned_loss=0.0869, over 21303.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.3028, pruned_loss=0.07171, over 4288221.57 frames. ], batch size: 143, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:50:03,948 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.22 vs. limit=15.0 2023-06-28 05:50:32,388 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.942e+02 8.987e+02 1.115e+03 1.521e+03 3.466e+03, threshold=2.231e+03, percent-clipped=6.0 2023-06-28 05:50:44,558 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1988016.0, ans=0.0 2023-06-28 05:51:02,136 INFO [train.py:996] (0/4) Epoch 11, batch 26400, loss[loss=0.2041, simple_loss=0.2628, pruned_loss=0.07269, over 21249.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.297, pruned_loss=0.07158, over 4280886.97 frames. ], batch size: 160, lr: 2.61e-03, grad_scale: 32.0 2023-06-28 05:51:09,819 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1988076.0, ans=0.125 2023-06-28 05:51:16,466 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1988076.0, ans=0.1 2023-06-28 05:51:45,308 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1988196.0, ans=0.0 2023-06-28 05:52:27,602 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.71 vs. limit=22.5 2023-06-28 05:52:30,164 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1988316.0, ans=0.95 2023-06-28 05:52:47,845 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1988376.0, ans=0.0 2023-06-28 05:52:48,908 INFO [train.py:996] (0/4) Epoch 11, batch 26450, loss[loss=0.2039, simple_loss=0.281, pruned_loss=0.06337, over 21364.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2959, pruned_loss=0.07087, over 4280588.97 frames. ], batch size: 211, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:52:55,153 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.59 vs. limit=15.0 2023-06-28 05:52:59,541 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=1988376.0, ans=0.025 2023-06-28 05:53:37,972 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1988496.0, ans=0.125 2023-06-28 05:53:50,575 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.28 vs. limit=22.5 2023-06-28 05:54:06,511 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1988556.0, ans=0.1 2023-06-28 05:54:09,265 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.671e+02 1.024e+03 1.650e+03 2.442e+03 4.564e+03, threshold=3.300e+03, percent-clipped=28.0 2023-06-28 05:54:16,804 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1988616.0, ans=0.1 2023-06-28 05:54:37,910 INFO [train.py:996] (0/4) Epoch 11, batch 26500, loss[loss=0.1699, simple_loss=0.2325, pruned_loss=0.05362, over 21326.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.2993, pruned_loss=0.06989, over 4269450.13 frames. ], batch size: 131, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:56:05,001 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1988916.0, ans=0.125 2023-06-28 05:56:28,875 INFO [train.py:996] (0/4) Epoch 11, batch 26550, loss[loss=0.1866, simple_loss=0.2952, pruned_loss=0.03897, over 21178.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.2945, pruned_loss=0.06763, over 4252464.66 frames. ], batch size: 548, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:56:56,430 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1989036.0, ans=0.1 2023-06-28 05:57:08,591 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 05:57:26,420 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1989096.0, ans=0.5 2023-06-28 05:57:29,586 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1989156.0, ans=0.0 2023-06-28 05:57:38,606 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.448e+02 7.811e+02 1.294e+03 2.097e+03 4.356e+03, threshold=2.588e+03, percent-clipped=4.0 2023-06-28 05:57:54,967 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1989216.0, ans=0.125 2023-06-28 05:58:04,754 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1989216.0, ans=0.125 2023-06-28 05:58:10,605 INFO [train.py:996] (0/4) Epoch 11, batch 26600, loss[loss=0.1819, simple_loss=0.2626, pruned_loss=0.05061, over 21439.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.2941, pruned_loss=0.06536, over 4247203.39 frames. ], batch size: 212, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:59:01,325 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1989396.0, ans=0.1 2023-06-28 05:59:52,629 INFO [train.py:996] (0/4) Epoch 11, batch 26650, loss[loss=0.1908, simple_loss=0.2752, pruned_loss=0.05315, over 21492.00 frames. ], tot_loss[loss=0.2084, simple_loss=0.288, pruned_loss=0.06441, over 4252633.86 frames. ], batch size: 473, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 06:00:01,552 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1989576.0, ans=0.125 2023-06-28 06:00:47,319 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1989696.0, ans=0.2 2023-06-28 06:00:50,556 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1989756.0, ans=0.0 2023-06-28 06:00:56,046 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.02 vs. limit=12.0 2023-06-28 06:01:05,988 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.789e+02 5.280e+02 6.792e+02 8.539e+02 2.170e+03, threshold=1.358e+03, percent-clipped=0.0 2023-06-28 06:01:33,813 INFO [train.py:996] (0/4) Epoch 11, batch 26700, loss[loss=0.2231, simple_loss=0.2925, pruned_loss=0.07684, over 21879.00 frames. ], tot_loss[loss=0.2016, simple_loss=0.2808, pruned_loss=0.06122, over 4264451.72 frames. ], batch size: 371, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 06:01:50,768 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1989876.0, ans=0.125 2023-06-28 06:01:59,932 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.22 vs. limit=6.0 2023-06-28 06:02:49,781 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1990056.0, ans=0.0 2023-06-28 06:03:01,953 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1990116.0, ans=0.125 2023-06-28 06:03:01,956 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1990116.0, ans=0.125 2023-06-28 06:03:08,828 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1990116.0, ans=0.1 2023-06-28 06:03:18,028 INFO [train.py:996] (0/4) Epoch 11, batch 26750, loss[loss=0.2636, simple_loss=0.3517, pruned_loss=0.08781, over 21823.00 frames. ], tot_loss[loss=0.2014, simple_loss=0.2813, pruned_loss=0.06071, over 4262454.84 frames. ], batch size: 124, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 06:03:20,801 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.10 vs. limit=15.0 2023-06-28 06:03:25,407 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1990176.0, ans=0.125 2023-06-28 06:04:32,794 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1990356.0, ans=0.125 2023-06-28 06:04:33,823 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.602e+02 7.194e+02 1.094e+03 1.684e+03 4.507e+03, threshold=2.188e+03, percent-clipped=37.0 2023-06-28 06:05:02,072 INFO [train.py:996] (0/4) Epoch 11, batch 26800, loss[loss=0.224, simple_loss=0.3033, pruned_loss=0.07236, over 21765.00 frames. ], tot_loss[loss=0.209, simple_loss=0.2889, pruned_loss=0.06461, over 4265145.01 frames. ], batch size: 332, lr: 2.61e-03, grad_scale: 32.0 2023-06-28 06:05:29,277 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1990536.0, ans=0.0 2023-06-28 06:06:06,885 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 06:06:19,803 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1990656.0, ans=0.125 2023-06-28 06:06:27,657 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1990716.0, ans=0.125 2023-06-28 06:06:28,425 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.63 vs. limit=22.5 2023-06-28 06:06:43,228 INFO [train.py:996] (0/4) Epoch 11, batch 26850, loss[loss=0.2024, simple_loss=0.2696, pruned_loss=0.06756, over 21808.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.2901, pruned_loss=0.06628, over 4260972.08 frames. ], batch size: 352, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 06:06:57,195 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1990776.0, ans=0.125 2023-06-28 06:07:05,347 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1990836.0, ans=0.07 2023-06-28 06:08:02,686 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.773e+02 7.985e+02 1.116e+03 1.630e+03 3.577e+03, threshold=2.232e+03, percent-clipped=14.0 2023-06-28 06:08:24,664 INFO [train.py:996] (0/4) Epoch 11, batch 26900, loss[loss=0.1968, simple_loss=0.2544, pruned_loss=0.06965, over 21352.00 frames. ], tot_loss[loss=0.2063, simple_loss=0.2817, pruned_loss=0.06543, over 4263901.63 frames. ], batch size: 160, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 06:09:16,261 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1991196.0, ans=0.125 2023-06-28 06:09:54,505 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.03 vs. limit=15.0 2023-06-28 06:09:59,838 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.58 vs. limit=15.0 2023-06-28 06:10:05,654 INFO [train.py:996] (0/4) Epoch 11, batch 26950, loss[loss=0.1953, simple_loss=0.281, pruned_loss=0.05484, over 21448.00 frames. ], tot_loss[loss=0.2073, simple_loss=0.2824, pruned_loss=0.06615, over 4261561.24 frames. ], batch size: 131, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 06:10:14,629 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1991376.0, ans=0.125 2023-06-28 06:11:11,465 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1991556.0, ans=0.125 2023-06-28 06:11:11,486 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1991556.0, ans=0.125 2023-06-28 06:11:27,132 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.642e+02 6.578e+02 9.395e+02 1.272e+03 2.979e+03, threshold=1.879e+03, percent-clipped=1.0 2023-06-28 06:11:32,996 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1991616.0, ans=0.2 2023-06-28 06:11:47,976 INFO [train.py:996] (0/4) Epoch 11, batch 27000, loss[loss=0.1874, simple_loss=0.2917, pruned_loss=0.04157, over 19799.00 frames. ], tot_loss[loss=0.2053, simple_loss=0.2826, pruned_loss=0.06396, over 4258617.67 frames. ], batch size: 702, lr: 2.60e-03, grad_scale: 8.0 2023-06-28 06:11:47,977 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-28 06:12:09,373 INFO [train.py:1028] (0/4) Epoch 11, validation: loss=0.246, simple_loss=0.3377, pruned_loss=0.07718, over 1796401.00 frames. 2023-06-28 06:12:09,374 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23714MB 2023-06-28 06:12:20,423 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1991676.0, ans=0.125 2023-06-28 06:12:34,250 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1991736.0, ans=0.1 2023-06-28 06:12:40,774 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1991736.0, ans=0.0 2023-06-28 06:13:57,804 INFO [train.py:996] (0/4) Epoch 11, batch 27050, loss[loss=0.197, simple_loss=0.283, pruned_loss=0.0555, over 21590.00 frames. ], tot_loss[loss=0.2046, simple_loss=0.2863, pruned_loss=0.06144, over 4261881.23 frames. ], batch size: 263, lr: 2.60e-03, grad_scale: 8.0 2023-06-28 06:14:02,958 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-332000.pt 2023-06-28 06:14:42,433 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.78 vs. limit=15.0 2023-06-28 06:15:07,777 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.395e+02 5.810e+02 8.142e+02 1.096e+03 2.681e+03, threshold=1.628e+03, percent-clipped=6.0 2023-06-28 06:15:08,559 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1992156.0, ans=0.125 2023-06-28 06:15:37,382 INFO [train.py:996] (0/4) Epoch 11, batch 27100, loss[loss=0.1889, simple_loss=0.2604, pruned_loss=0.0587, over 21205.00 frames. ], tot_loss[loss=0.2063, simple_loss=0.2871, pruned_loss=0.06274, over 4271258.40 frames. ], batch size: 607, lr: 2.60e-03, grad_scale: 8.0 2023-06-28 06:16:04,834 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1992336.0, ans=0.0 2023-06-28 06:16:24,710 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.42 vs. limit=22.5 2023-06-28 06:16:41,620 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1992456.0, ans=0.1 2023-06-28 06:17:01,734 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1992516.0, ans=0.1 2023-06-28 06:17:22,565 INFO [train.py:996] (0/4) Epoch 11, batch 27150, loss[loss=0.2302, simple_loss=0.3432, pruned_loss=0.05857, over 19877.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2997, pruned_loss=0.06642, over 4272479.35 frames. ], batch size: 702, lr: 2.60e-03, grad_scale: 8.0 2023-06-28 06:17:51,216 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1992636.0, ans=0.0 2023-06-28 06:17:52,722 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1992636.0, ans=0.2 2023-06-28 06:17:58,257 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.25 vs. limit=12.0 2023-06-28 06:18:35,181 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.727e+02 8.827e+02 1.451e+03 2.136e+03 4.044e+03, threshold=2.902e+03, percent-clipped=43.0 2023-06-28 06:18:55,599 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 06:18:59,850 INFO [train.py:996] (0/4) Epoch 11, batch 27200, loss[loss=0.213, simple_loss=0.2892, pruned_loss=0.06844, over 21623.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.3072, pruned_loss=0.06886, over 4272505.53 frames. ], batch size: 112, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 06:19:26,161 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1992936.0, ans=0.125 2023-06-28 06:19:36,204 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1992936.0, ans=0.125 2023-06-28 06:20:01,212 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1992996.0, ans=0.125 2023-06-28 06:20:15,907 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1993056.0, ans=0.125 2023-06-28 06:20:29,823 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1993116.0, ans=0.0 2023-06-28 06:20:49,584 INFO [train.py:996] (0/4) Epoch 11, batch 27250, loss[loss=0.2866, simple_loss=0.3455, pruned_loss=0.1139, over 21383.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.3088, pruned_loss=0.07231, over 4265505.59 frames. ], batch size: 471, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 06:20:53,578 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1993176.0, ans=0.125 2023-06-28 06:21:02,200 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1993176.0, ans=0.125 2023-06-28 06:21:08,665 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.90 vs. limit=15.0 2023-06-28 06:21:55,149 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1993356.0, ans=0.1 2023-06-28 06:21:55,194 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1993356.0, ans=0.2 2023-06-28 06:22:14,917 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.213e+02 7.322e+02 9.525e+02 1.331e+03 3.028e+03, threshold=1.905e+03, percent-clipped=1.0 2023-06-28 06:22:35,622 INFO [train.py:996] (0/4) Epoch 11, batch 27300, loss[loss=0.2416, simple_loss=0.3312, pruned_loss=0.07601, over 21309.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.3106, pruned_loss=0.07327, over 4270615.60 frames. ], batch size: 549, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 06:22:46,961 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.43 vs. limit=15.0 2023-06-28 06:23:06,795 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1993536.0, ans=0.125 2023-06-28 06:23:20,101 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1993536.0, ans=0.0 2023-06-28 06:24:20,015 INFO [train.py:996] (0/4) Epoch 11, batch 27350, loss[loss=0.2164, simple_loss=0.301, pruned_loss=0.0659, over 21476.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3121, pruned_loss=0.0743, over 4271595.42 frames. ], batch size: 194, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 06:24:24,407 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.01 vs. limit=15.0 2023-06-28 06:24:47,082 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.53 vs. limit=15.0 2023-06-28 06:25:17,322 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1993896.0, ans=0.1 2023-06-28 06:25:20,560 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 06:25:20,652 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1993896.0, ans=0.125 2023-06-28 06:25:41,652 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.836e+02 8.634e+02 1.277e+03 1.695e+03 3.535e+03, threshold=2.554e+03, percent-clipped=18.0 2023-06-28 06:25:47,183 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1994016.0, ans=0.125 2023-06-28 06:26:01,625 INFO [train.py:996] (0/4) Epoch 11, batch 27400, loss[loss=0.1978, simple_loss=0.2639, pruned_loss=0.06587, over 21512.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3066, pruned_loss=0.0731, over 4280118.32 frames. ], batch size: 548, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 06:26:46,433 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.44 vs. limit=15.0 2023-06-28 06:27:29,798 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1994316.0, ans=0.0 2023-06-28 06:27:37,987 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1994316.0, ans=0.125 2023-06-28 06:27:44,105 INFO [train.py:996] (0/4) Epoch 11, batch 27450, loss[loss=0.2132, simple_loss=0.302, pruned_loss=0.06222, over 21749.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.3014, pruned_loss=0.07166, over 4279384.14 frames. ], batch size: 351, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 06:28:52,506 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1994556.0, ans=0.0 2023-06-28 06:29:01,004 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1994556.0, ans=0.125 2023-06-28 06:29:05,151 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.109e+02 6.770e+02 1.005e+03 1.545e+03 3.220e+03, threshold=2.009e+03, percent-clipped=5.0 2023-06-28 06:29:25,916 INFO [train.py:996] (0/4) Epoch 11, batch 27500, loss[loss=0.2125, simple_loss=0.2807, pruned_loss=0.07215, over 21562.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.301, pruned_loss=0.0722, over 4289536.76 frames. ], batch size: 548, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 06:30:11,729 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1994796.0, ans=0.125 2023-06-28 06:31:16,353 INFO [train.py:996] (0/4) Epoch 11, batch 27550, loss[loss=0.1949, simple_loss=0.2625, pruned_loss=0.06366, over 21252.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2947, pruned_loss=0.06878, over 4293643.71 frames. ], batch size: 548, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 06:31:36,451 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1995036.0, ans=0.125 2023-06-28 06:32:27,433 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=1995156.0, ans=0.5 2023-06-28 06:32:28,479 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.464e+02 6.626e+02 9.640e+02 1.426e+03 2.852e+03, threshold=1.928e+03, percent-clipped=10.0 2023-06-28 06:32:52,121 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1995276.0, ans=0.2 2023-06-28 06:32:53,151 INFO [train.py:996] (0/4) Epoch 11, batch 27600, loss[loss=0.1794, simple_loss=0.2439, pruned_loss=0.05743, over 21574.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.2878, pruned_loss=0.06744, over 4284833.57 frames. ], batch size: 247, lr: 2.60e-03, grad_scale: 32.0 2023-06-28 06:32:53,853 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1995276.0, ans=0.1 2023-06-28 06:33:06,581 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1995276.0, ans=0.0 2023-06-28 06:33:33,085 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1995336.0, ans=0.125 2023-06-28 06:34:30,798 INFO [train.py:996] (0/4) Epoch 11, batch 27650, loss[loss=0.2284, simple_loss=0.3072, pruned_loss=0.07477, over 19982.00 frames. ], tot_loss[loss=0.2079, simple_loss=0.2824, pruned_loss=0.06669, over 4280206.64 frames. ], batch size: 702, lr: 2.60e-03, grad_scale: 32.0 2023-06-28 06:35:23,516 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1995696.0, ans=0.0 2023-06-28 06:35:54,453 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.549e+02 8.580e+02 1.335e+03 1.838e+03 2.881e+03, threshold=2.670e+03, percent-clipped=20.0 2023-06-28 06:36:17,807 INFO [train.py:996] (0/4) Epoch 11, batch 27700, loss[loss=0.2363, simple_loss=0.3254, pruned_loss=0.07357, over 21665.00 frames. ], tot_loss[loss=0.2062, simple_loss=0.2824, pruned_loss=0.065, over 4280610.86 frames. ], batch size: 389, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 06:36:45,810 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 06:37:12,475 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1995996.0, ans=0.0 2023-06-28 06:37:49,819 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.49 vs. limit=15.0 2023-06-28 06:38:04,197 INFO [train.py:996] (0/4) Epoch 11, batch 27750, loss[loss=0.1796, simple_loss=0.2619, pruned_loss=0.04861, over 21319.00 frames. ], tot_loss[loss=0.2085, simple_loss=0.2872, pruned_loss=0.06487, over 4275990.15 frames. ], batch size: 159, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 06:38:40,100 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1996296.0, ans=0.1 2023-06-28 06:38:56,605 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.46 vs. limit=10.0 2023-06-28 06:39:12,873 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1996356.0, ans=0.0 2023-06-28 06:39:18,759 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.799e+02 7.463e+02 1.002e+03 1.388e+03 2.774e+03, threshold=2.003e+03, percent-clipped=1.0 2023-06-28 06:39:27,552 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1996416.0, ans=0.0 2023-06-28 06:39:39,535 INFO [train.py:996] (0/4) Epoch 11, batch 27800, loss[loss=0.2182, simple_loss=0.2898, pruned_loss=0.07328, over 21727.00 frames. ], tot_loss[loss=0.2081, simple_loss=0.2858, pruned_loss=0.06521, over 4284029.33 frames. ], batch size: 389, lr: 2.60e-03, grad_scale: 8.0 2023-06-28 06:40:04,525 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.75 vs. limit=15.0 2023-06-28 06:40:33,273 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1996596.0, ans=0.125 2023-06-28 06:40:33,367 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1996596.0, ans=0.2 2023-06-28 06:41:26,059 INFO [train.py:996] (0/4) Epoch 11, batch 27850, loss[loss=0.2228, simple_loss=0.3131, pruned_loss=0.06625, over 21858.00 frames. ], tot_loss[loss=0.2087, simple_loss=0.285, pruned_loss=0.06619, over 4294083.76 frames. ], batch size: 332, lr: 2.60e-03, grad_scale: 8.0 2023-06-28 06:41:30,106 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1996776.0, ans=0.0 2023-06-28 06:42:49,901 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.063e+02 7.124e+02 9.695e+02 1.441e+03 2.660e+03, threshold=1.939e+03, percent-clipped=8.0 2023-06-28 06:43:16,100 INFO [train.py:996] (0/4) Epoch 11, batch 27900, loss[loss=0.19, simple_loss=0.2839, pruned_loss=0.04804, over 21394.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2934, pruned_loss=0.0671, over 4296117.77 frames. ], batch size: 211, lr: 2.60e-03, grad_scale: 8.0 2023-06-28 06:43:25,406 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1997076.0, ans=0.04949747468305833 2023-06-28 06:43:48,913 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1997136.0, ans=0.125 2023-06-28 06:43:49,111 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1997136.0, ans=0.0 2023-06-28 06:43:57,538 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1997196.0, ans=0.0 2023-06-28 06:44:35,485 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 06:44:57,013 INFO [train.py:996] (0/4) Epoch 11, batch 27950, loss[loss=0.2538, simple_loss=0.3695, pruned_loss=0.06905, over 19899.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.2936, pruned_loss=0.0641, over 4281477.20 frames. ], batch size: 703, lr: 2.60e-03, grad_scale: 8.0 2023-06-28 06:45:46,356 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1997496.0, ans=0.125 2023-06-28 06:46:20,482 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1997556.0, ans=0.0 2023-06-28 06:46:23,001 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.171e+02 6.116e+02 8.584e+02 1.262e+03 3.314e+03, threshold=1.717e+03, percent-clipped=6.0 2023-06-28 06:46:31,047 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.52 vs. limit=10.0 2023-06-28 06:46:39,423 INFO [train.py:996] (0/4) Epoch 11, batch 28000, loss[loss=0.2064, simple_loss=0.2762, pruned_loss=0.06827, over 21813.00 frames. ], tot_loss[loss=0.2073, simple_loss=0.2907, pruned_loss=0.06198, over 4281262.89 frames. ], batch size: 247, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 06:47:17,106 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1997796.0, ans=0.0 2023-06-28 06:47:18,396 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1997796.0, ans=0.125 2023-06-28 06:47:55,352 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1997856.0, ans=0.0 2023-06-28 06:48:09,023 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1997916.0, ans=0.125 2023-06-28 06:48:23,190 INFO [train.py:996] (0/4) Epoch 11, batch 28050, loss[loss=0.1933, simple_loss=0.2759, pruned_loss=0.05532, over 21797.00 frames. ], tot_loss[loss=0.2078, simple_loss=0.2884, pruned_loss=0.06361, over 4283485.70 frames. ], batch size: 332, lr: 2.60e-03, grad_scale: 8.0 2023-06-28 06:48:52,655 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1998036.0, ans=0.2 2023-06-28 06:49:38,099 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1998156.0, ans=0.125 2023-06-28 06:49:50,472 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.773e+02 7.059e+02 1.070e+03 1.534e+03 3.837e+03, threshold=2.141e+03, percent-clipped=19.0 2023-06-28 06:49:57,779 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1998216.0, ans=0.0 2023-06-28 06:49:57,805 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1998216.0, ans=0.0 2023-06-28 06:50:05,481 INFO [train.py:996] (0/4) Epoch 11, batch 28100, loss[loss=0.1911, simple_loss=0.2636, pruned_loss=0.05923, over 21735.00 frames. ], tot_loss[loss=0.2068, simple_loss=0.2865, pruned_loss=0.06358, over 4279244.94 frames. ], batch size: 371, lr: 2.60e-03, grad_scale: 8.0 2023-06-28 06:50:40,789 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1998396.0, ans=0.0 2023-06-28 06:51:11,641 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1998456.0, ans=0.2 2023-06-28 06:51:27,967 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1998516.0, ans=0.125 2023-06-28 06:51:29,533 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1998516.0, ans=0.125 2023-06-28 06:51:39,428 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1998516.0, ans=0.0 2023-06-28 06:51:42,263 INFO [train.py:996] (0/4) Epoch 11, batch 28150, loss[loss=0.2073, simple_loss=0.265, pruned_loss=0.07485, over 21226.00 frames. ], tot_loss[loss=0.2039, simple_loss=0.2801, pruned_loss=0.0638, over 4263258.50 frames. ], batch size: 471, lr: 2.60e-03, grad_scale: 8.0 2023-06-28 06:53:04,713 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.152e+02 7.382e+02 1.011e+03 1.548e+03 3.347e+03, threshold=2.022e+03, percent-clipped=11.0 2023-06-28 06:53:13,803 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1998816.0, ans=0.0 2023-06-28 06:53:19,839 INFO [train.py:996] (0/4) Epoch 11, batch 28200, loss[loss=0.2635, simple_loss=0.3978, pruned_loss=0.06458, over 19889.00 frames. ], tot_loss[loss=0.2045, simple_loss=0.2794, pruned_loss=0.06481, over 4265998.18 frames. ], batch size: 702, lr: 2.60e-03, grad_scale: 8.0 2023-06-28 06:53:32,482 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1998876.0, ans=0.0 2023-06-28 06:53:33,133 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.60 vs. limit=15.0 2023-06-28 06:53:36,174 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1998936.0, ans=0.1 2023-06-28 06:54:13,251 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.36 vs. limit=15.0 2023-06-28 06:54:53,648 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1999116.0, ans=0.125 2023-06-28 06:54:58,084 INFO [train.py:996] (0/4) Epoch 11, batch 28250, loss[loss=0.2144, simple_loss=0.2805, pruned_loss=0.07413, over 21159.00 frames. ], tot_loss[loss=0.2084, simple_loss=0.2827, pruned_loss=0.06703, over 4269006.45 frames. ], batch size: 143, lr: 2.60e-03, grad_scale: 8.0 2023-06-28 06:55:00,417 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1999176.0, ans=0.125 2023-06-28 06:56:21,466 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.911e+02 7.590e+02 1.013e+03 1.851e+03 3.926e+03, threshold=2.026e+03, percent-clipped=15.0 2023-06-28 06:56:37,218 INFO [train.py:996] (0/4) Epoch 11, batch 28300, loss[loss=0.1913, simple_loss=0.2911, pruned_loss=0.04574, over 21624.00 frames. ], tot_loss[loss=0.2056, simple_loss=0.2808, pruned_loss=0.06523, over 4275515.83 frames. ], batch size: 389, lr: 2.60e-03, grad_scale: 8.0 2023-06-28 06:56:46,428 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_ff2.min_abs, batch_count=1999476.0, ans=0.1 2023-06-28 06:57:21,232 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.08 vs. limit=6.0 2023-06-28 06:58:15,200 INFO [train.py:996] (0/4) Epoch 11, batch 28350, loss[loss=0.1584, simple_loss=0.2449, pruned_loss=0.03595, over 21366.00 frames. ], tot_loss[loss=0.2005, simple_loss=0.2788, pruned_loss=0.06112, over 4271074.71 frames. ], batch size: 211, lr: 2.60e-03, grad_scale: 8.0 2023-06-28 06:59:07,158 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.71 vs. limit=22.5 2023-06-28 06:59:15,011 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1999896.0, ans=0.125 2023-06-28 06:59:20,184 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1999956.0, ans=0.125 2023-06-28 06:59:25,682 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.01 vs. limit=15.0 2023-06-28 06:59:28,731 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1999956.0, ans=0.125 2023-06-28 06:59:30,580 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1999956.0, ans=0.2 2023-06-28 06:59:37,856 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.816e+02 8.082e+02 1.144e+03 1.595e+03 4.896e+03, threshold=2.288e+03, percent-clipped=16.0 2023-06-28 06:59:38,460 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2000016.0, ans=0.125 2023-06-28 06:59:55,517 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=2000016.0, ans=15.0 2023-06-28 06:59:57,609 INFO [train.py:996] (0/4) Epoch 11, batch 28400, loss[loss=0.2032, simple_loss=0.2703, pruned_loss=0.06804, over 21798.00 frames. ], tot_loss[loss=0.1986, simple_loss=0.2759, pruned_loss=0.06064, over 4255294.74 frames. ], batch size: 372, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 07:00:46,325 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=2000196.0, ans=0.0 2023-06-28 07:00:53,081 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2000196.0, ans=0.125 2023-06-28 07:00:56,387 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2000196.0, ans=0.0 2023-06-28 07:01:24,918 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2000316.0, ans=0.125 2023-06-28 07:01:40,570 INFO [train.py:996] (0/4) Epoch 11, batch 28450, loss[loss=0.238, simple_loss=0.3093, pruned_loss=0.08335, over 21740.00 frames. ], tot_loss[loss=0.2056, simple_loss=0.2821, pruned_loss=0.06451, over 4263249.59 frames. ], batch size: 414, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 07:01:53,485 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.60 vs. limit=22.5 2023-06-28 07:02:22,022 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=2000436.0, ans=0.2 2023-06-28 07:02:38,615 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2000496.0, ans=0.125 2023-06-28 07:03:03,715 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.045e+02 8.706e+02 1.350e+03 2.003e+03 3.584e+03, threshold=2.700e+03, percent-clipped=15.0 2023-06-28 07:03:28,164 INFO [train.py:996] (0/4) Epoch 11, batch 28500, loss[loss=0.2475, simple_loss=0.3189, pruned_loss=0.08807, over 21682.00 frames. ], tot_loss[loss=0.2089, simple_loss=0.2845, pruned_loss=0.06665, over 4272886.41 frames. ], batch size: 415, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 07:04:00,664 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 07:04:04,051 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=2000736.0, ans=0.025 2023-06-28 07:04:11,349 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.02 vs. limit=15.0 2023-06-28 07:04:19,276 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2000796.0, ans=0.0 2023-06-28 07:05:09,576 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.57 vs. limit=15.0 2023-06-28 07:05:11,649 INFO [train.py:996] (0/4) Epoch 11, batch 28550, loss[loss=0.3315, simple_loss=0.4156, pruned_loss=0.1237, over 21527.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2912, pruned_loss=0.06849, over 4276730.36 frames. ], batch size: 471, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 07:05:14,630 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.21 vs. limit=6.0 2023-06-28 07:05:42,122 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2001036.0, ans=0.125 2023-06-28 07:05:57,368 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2001096.0, ans=0.125 2023-06-28 07:06:41,152 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.280e+02 7.153e+02 1.164e+03 1.649e+03 3.101e+03, threshold=2.329e+03, percent-clipped=2.0 2023-06-28 07:06:59,351 INFO [train.py:996] (0/4) Epoch 11, batch 28600, loss[loss=0.2632, simple_loss=0.3368, pruned_loss=0.0948, over 21557.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.2982, pruned_loss=0.07049, over 4276610.00 frames. ], batch size: 414, lr: 2.60e-03, grad_scale: 8.0 2023-06-28 07:07:51,685 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=2001396.0, ans=0.04949747468305833 2023-06-28 07:08:10,396 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2001456.0, ans=0.0 2023-06-28 07:08:16,204 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.68 vs. limit=15.0 2023-06-28 07:08:41,521 INFO [train.py:996] (0/4) Epoch 11, batch 28650, loss[loss=0.183, simple_loss=0.2471, pruned_loss=0.05946, over 21135.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2926, pruned_loss=0.07002, over 4277781.06 frames. ], batch size: 176, lr: 2.60e-03, grad_scale: 8.0 2023-06-28 07:08:46,084 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.34 vs. limit=12.0 2023-06-28 07:08:47,567 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2001576.0, ans=0.125 2023-06-28 07:09:42,509 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2001756.0, ans=0.125 2023-06-28 07:09:51,071 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2001756.0, ans=0.1 2023-06-28 07:09:54,249 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2001756.0, ans=0.0 2023-06-28 07:10:06,741 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.860e+02 6.883e+02 1.055e+03 1.706e+03 3.634e+03, threshold=2.110e+03, percent-clipped=9.0 2023-06-28 07:10:19,940 INFO [train.py:996] (0/4) Epoch 11, batch 28700, loss[loss=0.2005, simple_loss=0.2753, pruned_loss=0.06283, over 20661.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.2907, pruned_loss=0.07045, over 4267791.31 frames. ], batch size: 607, lr: 2.60e-03, grad_scale: 8.0 2023-06-28 07:10:25,337 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=2001876.0, ans=0.125 2023-06-28 07:12:03,056 INFO [train.py:996] (0/4) Epoch 11, batch 28750, loss[loss=0.2344, simple_loss=0.317, pruned_loss=0.0759, over 21725.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2911, pruned_loss=0.07078, over 4268688.21 frames. ], batch size: 441, lr: 2.60e-03, grad_scale: 8.0 2023-06-28 07:12:41,079 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.23 vs. limit=15.0 2023-06-28 07:13:01,270 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=2002296.0, ans=0.125 2023-06-28 07:13:33,315 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.047e+02 7.903e+02 1.280e+03 1.957e+03 3.313e+03, threshold=2.559e+03, percent-clipped=20.0 2023-06-28 07:13:46,641 INFO [train.py:996] (0/4) Epoch 11, batch 28800, loss[loss=0.2324, simple_loss=0.3071, pruned_loss=0.07888, over 21757.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2954, pruned_loss=0.07113, over 4273297.46 frames. ], batch size: 298, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 07:13:57,786 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2002476.0, ans=0.0 2023-06-28 07:14:01,603 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.38 vs. limit=15.0 2023-06-28 07:14:03,217 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=2002536.0, ans=6.0 2023-06-28 07:14:57,456 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2002656.0, ans=0.125 2023-06-28 07:15:07,422 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=2002656.0, ans=0.0 2023-06-28 07:15:28,374 INFO [train.py:996] (0/4) Epoch 11, batch 28850, loss[loss=0.2261, simple_loss=0.2926, pruned_loss=0.0798, over 21488.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.2966, pruned_loss=0.07301, over 4282658.13 frames. ], batch size: 144, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 07:15:30,524 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=2002776.0, ans=0.125 2023-06-28 07:16:18,447 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=2002896.0, ans=0.5 2023-06-28 07:16:54,687 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2003016.0, ans=0.125 2023-06-28 07:16:58,888 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.940e+02 7.188e+02 1.079e+03 1.563e+03 3.306e+03, threshold=2.159e+03, percent-clipped=4.0 2023-06-28 07:17:12,880 INFO [train.py:996] (0/4) Epoch 11, batch 28900, loss[loss=0.2187, simple_loss=0.2947, pruned_loss=0.07138, over 21691.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.2994, pruned_loss=0.07411, over 4282340.82 frames. ], batch size: 351, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 07:17:23,206 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=2003076.0, ans=0.07 2023-06-28 07:18:11,350 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=2003196.0, ans=0.0 2023-06-28 07:18:18,716 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.79 vs. limit=6.0 2023-06-28 07:19:06,041 INFO [train.py:996] (0/4) Epoch 11, batch 28950, loss[loss=0.2906, simple_loss=0.3774, pruned_loss=0.1019, over 21480.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.3016, pruned_loss=0.07369, over 4277265.90 frames. ], batch size: 507, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 07:19:50,739 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=2003496.0, ans=0.05 2023-06-28 07:20:36,315 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.876e+02 7.501e+02 1.038e+03 1.526e+03 3.753e+03, threshold=2.076e+03, percent-clipped=10.0 2023-06-28 07:20:45,522 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2003616.0, ans=0.125 2023-06-28 07:20:54,746 INFO [train.py:996] (0/4) Epoch 11, batch 29000, loss[loss=0.2183, simple_loss=0.3089, pruned_loss=0.06383, over 21432.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.3032, pruned_loss=0.07253, over 4276210.68 frames. ], batch size: 131, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 07:20:55,378 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=2003676.0, ans=0.0 2023-06-28 07:21:00,780 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff3.min_abs, batch_count=2003676.0, ans=0.2 2023-06-28 07:21:06,287 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=2003676.0, ans=15.0 2023-06-28 07:21:32,000 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2003796.0, ans=0.125 2023-06-28 07:22:31,554 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2003916.0, ans=0.1 2023-06-28 07:22:35,930 INFO [train.py:996] (0/4) Epoch 11, batch 29050, loss[loss=0.2143, simple_loss=0.2754, pruned_loss=0.07664, over 21515.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.3017, pruned_loss=0.07347, over 4278397.72 frames. ], batch size: 548, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 07:23:04,324 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=2004036.0, ans=0.125 2023-06-28 07:23:16,139 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=2004096.0, ans=0.0 2023-06-28 07:24:04,795 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.241e+02 7.722e+02 1.077e+03 1.560e+03 2.970e+03, threshold=2.155e+03, percent-clipped=7.0 2023-06-28 07:24:18,299 INFO [train.py:996] (0/4) Epoch 11, batch 29100, loss[loss=0.1694, simple_loss=0.2403, pruned_loss=0.04923, over 21667.00 frames. ], tot_loss[loss=0.2181, simple_loss=0.2942, pruned_loss=0.07103, over 4270099.10 frames. ], batch size: 282, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 07:24:55,082 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2004396.0, ans=0.125 2023-06-28 07:25:08,409 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2004396.0, ans=0.125 2023-06-28 07:25:29,264 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.66 vs. limit=15.0 2023-06-28 07:25:41,069 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.52 vs. limit=12.0 2023-06-28 07:25:50,230 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2004516.0, ans=0.125 2023-06-28 07:25:59,495 INFO [train.py:996] (0/4) Epoch 11, batch 29150, loss[loss=0.2136, simple_loss=0.2932, pruned_loss=0.06699, over 21678.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2931, pruned_loss=0.06972, over 4266072.66 frames. ], batch size: 247, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 07:26:15,207 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.91 vs. limit=15.0 2023-06-28 07:26:28,283 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.00 vs. limit=15.0 2023-06-28 07:27:16,519 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.34 vs. limit=6.0 2023-06-28 07:27:19,320 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=2004756.0, ans=10.0 2023-06-28 07:27:26,707 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.585e+02 7.263e+02 1.061e+03 1.748e+03 3.304e+03, threshold=2.122e+03, percent-clipped=12.0 2023-06-28 07:27:27,527 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=2004816.0, ans=0.125 2023-06-28 07:27:29,105 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2004816.0, ans=0.0 2023-06-28 07:27:39,640 INFO [train.py:996] (0/4) Epoch 11, batch 29200, loss[loss=0.2061, simple_loss=0.2693, pruned_loss=0.07141, over 21179.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.289, pruned_loss=0.06909, over 4268677.10 frames. ], batch size: 176, lr: 2.60e-03, grad_scale: 32.0 2023-06-28 07:28:10,864 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2004936.0, ans=0.1 2023-06-28 07:28:16,147 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2004996.0, ans=0.0 2023-06-28 07:29:25,807 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.62 vs. limit=15.0 2023-06-28 07:29:26,299 INFO [train.py:996] (0/4) Epoch 11, batch 29250, loss[loss=0.2075, simple_loss=0.2874, pruned_loss=0.06382, over 21096.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2872, pruned_loss=0.0666, over 4271029.44 frames. ], batch size: 143, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 07:29:32,533 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2005176.0, ans=0.125 2023-06-28 07:30:44,427 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2005356.0, ans=0.125 2023-06-28 07:30:52,283 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.090e+02 7.920e+02 1.206e+03 1.772e+03 3.423e+03, threshold=2.413e+03, percent-clipped=14.0 2023-06-28 07:31:05,480 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2005416.0, ans=0.1 2023-06-28 07:31:08,126 INFO [train.py:996] (0/4) Epoch 11, batch 29300, loss[loss=0.22, simple_loss=0.2869, pruned_loss=0.0765, over 21463.00 frames. ], tot_loss[loss=0.211, simple_loss=0.2897, pruned_loss=0.0662, over 4272846.42 frames. ], batch size: 389, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 07:31:17,027 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2005476.0, ans=0.125 2023-06-28 07:31:27,164 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.57 vs. limit=15.0 2023-06-28 07:32:25,359 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=2005716.0, ans=0.035 2023-06-28 07:32:35,515 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.61 vs. limit=15.0 2023-06-28 07:32:46,277 INFO [train.py:996] (0/4) Epoch 11, batch 29350, loss[loss=0.2052, simple_loss=0.287, pruned_loss=0.06176, over 21549.00 frames. ], tot_loss[loss=0.2076, simple_loss=0.285, pruned_loss=0.06508, over 4268807.73 frames. ], batch size: 230, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 07:32:51,871 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=2005776.0, ans=0.05 2023-06-28 07:32:52,586 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.27 vs. limit=15.0 2023-06-28 07:32:53,870 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 07:33:24,617 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.07 vs. limit=15.0 2023-06-28 07:34:10,869 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.37 vs. limit=15.0 2023-06-28 07:34:18,450 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.524e+02 6.284e+02 9.416e+02 1.465e+03 2.688e+03, threshold=1.883e+03, percent-clipped=1.0 2023-06-28 07:34:21,575 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.13 vs. limit=22.5 2023-06-28 07:34:23,101 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.93 vs. limit=6.0 2023-06-28 07:34:29,999 INFO [train.py:996] (0/4) Epoch 11, batch 29400, loss[loss=0.1671, simple_loss=0.2434, pruned_loss=0.0454, over 21780.00 frames. ], tot_loss[loss=0.2058, simple_loss=0.2851, pruned_loss=0.06326, over 4264704.53 frames. ], batch size: 282, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 07:35:40,552 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=6.50 vs. limit=22.5 2023-06-28 07:35:43,402 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=2006256.0, ans=0.0 2023-06-28 07:35:45,025 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=2006256.0, ans=0.125 2023-06-28 07:35:45,093 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2006256.0, ans=0.1 2023-06-28 07:36:12,405 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2006376.0, ans=0.1 2023-06-28 07:36:13,431 INFO [train.py:996] (0/4) Epoch 11, batch 29450, loss[loss=0.2347, simple_loss=0.3161, pruned_loss=0.07667, over 21413.00 frames. ], tot_loss[loss=0.2044, simple_loss=0.2834, pruned_loss=0.06272, over 4265445.84 frames. ], batch size: 131, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 07:37:29,767 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 07:37:43,663 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.711e+02 7.454e+02 1.204e+03 1.829e+03 3.653e+03, threshold=2.407e+03, percent-clipped=22.0 2023-06-28 07:37:44,340 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2006616.0, ans=0.0 2023-06-28 07:37:54,985 INFO [train.py:996] (0/4) Epoch 11, batch 29500, loss[loss=0.2216, simple_loss=0.2891, pruned_loss=0.07704, over 21845.00 frames. ], tot_loss[loss=0.2097, simple_loss=0.2873, pruned_loss=0.06603, over 4272217.35 frames. ], batch size: 441, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 07:38:00,501 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=2006676.0, ans=0.015 2023-06-28 07:38:00,606 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=2006676.0, ans=0.2 2023-06-28 07:38:13,816 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=2006736.0, ans=0.05 2023-06-28 07:38:40,982 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=2006796.0, ans=0.125 2023-06-28 07:39:01,876 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2006856.0, ans=0.125 2023-06-28 07:39:05,343 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=2006856.0, ans=0.125 2023-06-28 07:39:23,653 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2006916.0, ans=0.0 2023-06-28 07:39:36,797 INFO [train.py:996] (0/4) Epoch 11, batch 29550, loss[loss=0.193, simple_loss=0.267, pruned_loss=0.05947, over 21662.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.2865, pruned_loss=0.06729, over 4283412.08 frames. ], batch size: 263, lr: 2.59e-03, grad_scale: 16.0 2023-06-28 07:39:42,415 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=2006976.0, ans=0.125 2023-06-28 07:39:46,451 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.06 vs. limit=15.0 2023-06-28 07:41:00,967 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=2007156.0, ans=0.0 2023-06-28 07:41:08,576 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.953e+02 8.227e+02 1.182e+03 1.842e+03 6.634e+03, threshold=2.364e+03, percent-clipped=14.0 2023-06-28 07:41:19,883 INFO [train.py:996] (0/4) Epoch 11, batch 29600, loss[loss=0.2459, simple_loss=0.3361, pruned_loss=0.07787, over 21636.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2932, pruned_loss=0.06964, over 4286064.12 frames. ], batch size: 263, lr: 2.59e-03, grad_scale: 32.0 2023-06-28 07:41:25,895 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=2007276.0, ans=0.125 2023-06-28 07:41:56,014 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=3.73 vs. limit=15.0 2023-06-28 07:42:26,104 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2007456.0, ans=0.1 2023-06-28 07:42:57,471 INFO [train.py:996] (0/4) Epoch 11, batch 29650, loss[loss=0.2055, simple_loss=0.2772, pruned_loss=0.06689, over 21456.00 frames. ], tot_loss[loss=0.211, simple_loss=0.2898, pruned_loss=0.06611, over 4282351.36 frames. ], batch size: 131, lr: 2.59e-03, grad_scale: 16.0 2023-06-28 07:43:17,748 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=2007636.0, ans=0.0 2023-06-28 07:43:19,479 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2007636.0, ans=0.125 2023-06-28 07:44:26,229 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.283e+02 7.090e+02 1.074e+03 1.668e+03 4.986e+03, threshold=2.147e+03, percent-clipped=16.0 2023-06-28 07:44:32,190 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2007816.0, ans=0.125 2023-06-28 07:44:40,983 INFO [train.py:996] (0/4) Epoch 11, batch 29700, loss[loss=0.1917, simple_loss=0.3093, pruned_loss=0.03707, over 19815.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.2898, pruned_loss=0.06605, over 4271561.59 frames. ], batch size: 702, lr: 2.59e-03, grad_scale: 16.0 2023-06-28 07:45:42,251 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=2007996.0, ans=0.125 2023-06-28 07:45:47,405 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2008056.0, ans=0.0 2023-06-28 07:46:22,823 INFO [train.py:996] (0/4) Epoch 11, batch 29750, loss[loss=0.2015, simple_loss=0.2798, pruned_loss=0.06158, over 21226.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.2966, pruned_loss=0.06625, over 4280543.01 frames. ], batch size: 143, lr: 2.59e-03, grad_scale: 16.0 2023-06-28 07:46:39,366 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2008176.0, ans=0.125 2023-06-28 07:47:31,271 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=2008356.0, ans=0.0 2023-06-28 07:47:40,456 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.85 vs. limit=22.5 2023-06-28 07:47:49,324 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.946e+02 7.116e+02 1.082e+03 1.518e+03 2.580e+03, threshold=2.164e+03, percent-clipped=5.0 2023-06-28 07:48:01,002 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2008416.0, ans=0.1 2023-06-28 07:48:07,946 INFO [train.py:996] (0/4) Epoch 11, batch 29800, loss[loss=0.2157, simple_loss=0.2881, pruned_loss=0.07167, over 21328.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2976, pruned_loss=0.06644, over 4279079.07 frames. ], batch size: 144, lr: 2.59e-03, grad_scale: 16.0 2023-06-28 07:48:26,563 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2008476.0, ans=0.0 2023-06-28 07:48:43,930 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=2008536.0, ans=0.125 2023-06-28 07:49:00,821 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2008596.0, ans=0.1 2023-06-28 07:49:02,495 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=2008596.0, ans=0.035 2023-06-28 07:49:28,527 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2008716.0, ans=0.0 2023-06-28 07:49:43,441 INFO [train.py:996] (0/4) Epoch 11, batch 29850, loss[loss=0.2175, simple_loss=0.2971, pruned_loss=0.06894, over 21542.00 frames. ], tot_loss[loss=0.212, simple_loss=0.2938, pruned_loss=0.06511, over 4283141.18 frames. ], batch size: 131, lr: 2.59e-03, grad_scale: 16.0 2023-06-28 07:50:22,362 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=2008836.0, ans=0.125 2023-06-28 07:51:09,754 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.656e+02 6.362e+02 8.665e+02 1.424e+03 2.891e+03, threshold=1.733e+03, percent-clipped=5.0 2023-06-28 07:51:29,357 INFO [train.py:996] (0/4) Epoch 11, batch 29900, loss[loss=0.2314, simple_loss=0.347, pruned_loss=0.05791, over 19805.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2916, pruned_loss=0.06584, over 4285618.68 frames. ], batch size: 703, lr: 2.59e-03, grad_scale: 16.0 2023-06-28 07:51:33,134 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2009076.0, ans=0.125 2023-06-28 07:51:55,977 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=2009136.0, ans=0.0 2023-06-28 07:52:57,315 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2009316.0, ans=0.125 2023-06-28 07:52:57,839 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=5.59 vs. limit=22.5 2023-06-28 07:53:11,897 INFO [train.py:996] (0/4) Epoch 11, batch 29950, loss[loss=0.2581, simple_loss=0.3282, pruned_loss=0.09394, over 21347.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2949, pruned_loss=0.06882, over 4277830.12 frames. ], batch size: 176, lr: 2.59e-03, grad_scale: 16.0 2023-06-28 07:53:19,468 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.47 vs. limit=22.5 2023-06-28 07:53:20,714 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2009376.0, ans=0.125 2023-06-28 07:54:11,419 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=2009556.0, ans=0.125 2023-06-28 07:54:21,448 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2009556.0, ans=0.125 2023-06-28 07:54:41,417 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.94 vs. limit=12.0 2023-06-28 07:54:50,049 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.989e+02 7.621e+02 1.240e+03 1.706e+03 3.587e+03, threshold=2.479e+03, percent-clipped=22.0 2023-06-28 07:54:50,837 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 07:54:58,828 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=2009616.0, ans=0.2 2023-06-28 07:55:04,727 INFO [train.py:996] (0/4) Epoch 11, batch 30000, loss[loss=0.1867, simple_loss=0.2816, pruned_loss=0.04588, over 21651.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2966, pruned_loss=0.06918, over 4269600.26 frames. ], batch size: 263, lr: 2.59e-03, grad_scale: 32.0 2023-06-28 07:55:04,728 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-28 07:55:14,101 INFO [zipformer.py:1728] (0/4) name=encoder.encoders.5.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([2.3861, 2.2836, 4.3568, 4.0544], device='cuda:0') 2023-06-28 07:55:21,706 INFO [train.py:1028] (0/4) Epoch 11, validation: loss=0.2519, simple_loss=0.3444, pruned_loss=0.07975, over 1796401.00 frames. 2023-06-28 07:55:21,707 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23714MB 2023-06-28 07:56:53,820 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=2009916.0, ans=0.05 2023-06-28 07:57:10,930 INFO [train.py:996] (0/4) Epoch 11, batch 30050, loss[loss=0.2309, simple_loss=0.3662, pruned_loss=0.04781, over 20788.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.3002, pruned_loss=0.06646, over 4272010.64 frames. ], batch size: 607, lr: 2.59e-03, grad_scale: 32.0 2023-06-28 07:58:44,578 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.592e+02 6.904e+02 1.436e+03 2.249e+03 4.425e+03, threshold=2.873e+03, percent-clipped=20.0 2023-06-28 07:58:53,200 INFO [train.py:996] (0/4) Epoch 11, batch 30100, loss[loss=0.2035, simple_loss=0.2717, pruned_loss=0.06767, over 21781.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2985, pruned_loss=0.0655, over 4265888.47 frames. ], batch size: 351, lr: 2.59e-03, grad_scale: 16.0 2023-06-28 07:59:38,035 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.57 vs. limit=22.5 2023-06-28 08:00:16,539 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2010456.0, ans=0.125 2023-06-28 08:00:36,123 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.11 vs. limit=15.0 2023-06-28 08:00:36,680 INFO [train.py:996] (0/4) Epoch 11, batch 30150, loss[loss=0.1938, simple_loss=0.2644, pruned_loss=0.0616, over 21614.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2937, pruned_loss=0.06668, over 4266742.91 frames. ], batch size: 298, lr: 2.59e-03, grad_scale: 16.0 2023-06-28 08:00:37,527 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2010576.0, ans=0.1 2023-06-28 08:01:22,097 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=2010636.0, ans=0.2 2023-06-28 08:01:34,667 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=2010696.0, ans=0.0 2023-06-28 08:01:44,018 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.79 vs. limit=15.0 2023-06-28 08:01:48,312 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=2010756.0, ans=0.125 2023-06-28 08:02:02,498 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2010756.0, ans=0.1 2023-06-28 08:02:02,510 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=2010756.0, ans=0.2 2023-06-28 08:02:18,805 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.690e+02 6.702e+02 9.063e+02 1.523e+03 3.175e+03, threshold=1.813e+03, percent-clipped=2.0 2023-06-28 08:02:36,690 INFO [train.py:996] (0/4) Epoch 11, batch 30200, loss[loss=0.3163, simple_loss=0.3939, pruned_loss=0.1194, over 21408.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2973, pruned_loss=0.06686, over 4270515.74 frames. ], batch size: 507, lr: 2.59e-03, grad_scale: 16.0 2023-06-28 08:03:54,914 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2011116.0, ans=0.125 2023-06-28 08:04:03,349 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.52 vs. limit=15.0 2023-06-28 08:04:14,127 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2011116.0, ans=0.125 2023-06-28 08:04:15,810 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=2011116.0, ans=0.05 2023-06-28 08:04:21,936 INFO [train.py:996] (0/4) Epoch 11, batch 30250, loss[loss=0.2743, simple_loss=0.3895, pruned_loss=0.07956, over 21255.00 frames. ], tot_loss[loss=0.222, simple_loss=0.3052, pruned_loss=0.0694, over 4269760.37 frames. ], batch size: 549, lr: 2.59e-03, grad_scale: 16.0 2023-06-28 08:04:51,808 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2011236.0, ans=0.1 2023-06-28 08:05:55,953 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.393e+02 7.352e+02 1.154e+03 1.714e+03 3.720e+03, threshold=2.308e+03, percent-clipped=21.0 2023-06-28 08:06:04,342 INFO [train.py:996] (0/4) Epoch 11, batch 30300, loss[loss=0.202, simple_loss=0.2754, pruned_loss=0.06428, over 22017.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.3033, pruned_loss=0.0699, over 4270692.60 frames. ], batch size: 103, lr: 2.59e-03, grad_scale: 16.0 2023-06-28 08:06:17,410 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=2011476.0, ans=0.125 2023-06-28 08:06:25,800 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=2011536.0, ans=0.2 2023-06-28 08:06:45,416 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2011596.0, ans=0.125 2023-06-28 08:06:48,654 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=2011596.0, ans=0.09899494936611666 2023-06-28 08:06:52,466 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.16 vs. limit=15.0 2023-06-28 08:07:50,270 INFO [train.py:996] (0/4) Epoch 11, batch 30350, loss[loss=0.3293, simple_loss=0.4162, pruned_loss=0.1212, over 21456.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.3026, pruned_loss=0.07101, over 4269474.69 frames. ], batch size: 471, lr: 2.59e-03, grad_scale: 16.0 2023-06-28 08:07:57,748 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2011776.0, ans=0.125 2023-06-28 08:08:57,089 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.778e+02 8.957e+02 1.374e+03 2.295e+03 4.777e+03, threshold=2.749e+03, percent-clipped=24.0 2023-06-28 08:09:11,744 INFO [train.py:996] (0/4) Epoch 11, batch 30400, loss[loss=0.2125, simple_loss=0.2656, pruned_loss=0.07967, over 20263.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2977, pruned_loss=0.06991, over 4260592.29 frames. ], batch size: 703, lr: 2.59e-03, grad_scale: 32.0 2023-06-28 08:09:15,480 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2012076.0, ans=0.0 2023-06-28 08:09:50,400 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 08:10:34,696 INFO [train.py:996] (0/4) Epoch 11, batch 30450, loss[loss=0.232, simple_loss=0.3251, pruned_loss=0.06951, over 19977.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2975, pruned_loss=0.06918, over 4201611.04 frames. ], batch size: 702, lr: 2.59e-03, grad_scale: 8.0 2023-06-28 08:10:42,720 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2012376.0, ans=0.1 2023-06-28 08:10:58,699 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 08:11:06,443 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2012436.0, ans=0.125 2023-06-28 08:11:19,464 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=2012496.0, ans=15.0 2023-06-28 08:11:25,028 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.20 vs. limit=15.0 2023-06-28 08:11:38,902 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2012616.0, ans=0.125 2023-06-28 08:11:45,504 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/epoch-11.pt 2023-06-28 08:13:53,274 INFO [train.py:996] (0/4) Epoch 12, batch 0, loss[loss=0.1964, simple_loss=0.262, pruned_loss=0.06536, over 21538.00 frames. ], tot_loss[loss=0.1964, simple_loss=0.262, pruned_loss=0.06536, over 21538.00 frames. ], batch size: 263, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:13:53,276 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-28 08:14:03,502 INFO [zipformer.py:1728] (0/4) name=encoder.encoders.5.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([2.3146, 2.0586, 3.7380, 3.5335], device='cuda:0') 2023-06-28 08:14:09,655 INFO [train.py:1028] (0/4) Epoch 12, validation: loss=0.2477, simple_loss=0.3485, pruned_loss=0.0734, over 1796401.00 frames. 2023-06-28 08:14:09,656 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23714MB 2023-06-28 08:14:12,903 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 6.161e+02 1.803e+03 3.374e+03 5.381e+03 1.358e+04, threshold=6.748e+03, percent-clipped=56.0 2023-06-28 08:14:36,070 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.10 vs. limit=22.5 2023-06-28 08:14:36,233 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=15.32 vs. limit=15.0 2023-06-28 08:15:22,314 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=2012826.0, ans=0.0 2023-06-28 08:15:25,586 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2012826.0, ans=0.0 2023-06-28 08:15:53,140 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2012946.0, ans=0.125 2023-06-28 08:15:54,174 INFO [train.py:996] (0/4) Epoch 12, batch 50, loss[loss=0.2314, simple_loss=0.3085, pruned_loss=0.07712, over 21406.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.3044, pruned_loss=0.07017, over 961914.56 frames. ], batch size: 176, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:16:45,473 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.62 vs. limit=15.0 2023-06-28 08:16:58,554 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=2013066.0, ans=0.2 2023-06-28 08:17:03,366 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=2013126.0, ans=0.125 2023-06-28 08:17:08,336 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=2013126.0, ans=0.2 2023-06-28 08:17:13,294 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2013126.0, ans=0.125 2023-06-28 08:17:17,914 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2013126.0, ans=0.125 2023-06-28 08:17:37,276 INFO [train.py:996] (0/4) Epoch 12, batch 100, loss[loss=0.2787, simple_loss=0.366, pruned_loss=0.09567, over 21547.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3215, pruned_loss=0.07351, over 1691813.59 frames. ], batch size: 471, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:17:40,464 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.118e+02 6.672e+02 9.899e+02 1.706e+03 3.699e+03, threshold=1.980e+03, percent-clipped=0.0 2023-06-28 08:18:40,279 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=2013366.0, ans=0.125 2023-06-28 08:18:56,086 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=2013426.0, ans=0.125 2023-06-28 08:19:08,560 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=2013486.0, ans=22.5 2023-06-28 08:19:18,618 INFO [train.py:996] (0/4) Epoch 12, batch 150, loss[loss=0.2313, simple_loss=0.3314, pruned_loss=0.06559, over 21399.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3215, pruned_loss=0.07273, over 2257920.52 frames. ], batch size: 211, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:19:48,094 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.84 vs. limit=22.5 2023-06-28 08:20:34,022 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.00 vs. limit=6.0 2023-06-28 08:20:36,925 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2013726.0, ans=0.125 2023-06-28 08:20:38,550 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=2013726.0, ans=10.0 2023-06-28 08:20:57,644 INFO [train.py:996] (0/4) Epoch 12, batch 200, loss[loss=0.1894, simple_loss=0.2728, pruned_loss=0.05297, over 21404.00 frames. ], tot_loss[loss=0.23, simple_loss=0.317, pruned_loss=0.07154, over 2704725.98 frames. ], batch size: 194, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:21:00,953 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.147e+02 7.904e+02 1.199e+03 1.656e+03 3.803e+03, threshold=2.398e+03, percent-clipped=21.0 2023-06-28 08:22:04,816 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.88 vs. limit=15.0 2023-06-28 08:22:12,675 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=2014026.0, ans=10.0 2023-06-28 08:22:41,054 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=2014146.0, ans=0.0 2023-06-28 08:22:42,025 INFO [train.py:996] (0/4) Epoch 12, batch 250, loss[loss=0.1949, simple_loss=0.2737, pruned_loss=0.05803, over 21883.00 frames. ], tot_loss[loss=0.226, simple_loss=0.3117, pruned_loss=0.0701, over 3061485.31 frames. ], batch size: 124, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:23:07,339 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=2014206.0, ans=0.125 2023-06-28 08:23:54,814 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.56 vs. limit=15.0 2023-06-28 08:24:03,077 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2014326.0, ans=0.125 2023-06-28 08:24:05,609 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.12 vs. limit=22.5 2023-06-28 08:24:32,100 INFO [train.py:996] (0/4) Epoch 12, batch 300, loss[loss=0.1651, simple_loss=0.2316, pruned_loss=0.04926, over 21602.00 frames. ], tot_loss[loss=0.224, simple_loss=0.3076, pruned_loss=0.07016, over 3331945.91 frames. ], batch size: 231, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:24:35,392 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.745e+02 6.973e+02 9.077e+02 1.413e+03 3.093e+03, threshold=1.815e+03, percent-clipped=6.0 2023-06-28 08:26:20,885 INFO [train.py:996] (0/4) Epoch 12, batch 350, loss[loss=0.2081, simple_loss=0.2744, pruned_loss=0.07092, over 20091.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2988, pruned_loss=0.06801, over 3534625.33 frames. ], batch size: 704, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:27:35,616 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=2014926.0, ans=0.0 2023-06-28 08:27:37,229 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=2014926.0, ans=0.125 2023-06-28 08:28:02,854 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2014986.0, ans=0.125 2023-06-28 08:28:07,204 INFO [train.py:996] (0/4) Epoch 12, batch 400, loss[loss=0.1746, simple_loss=0.2504, pruned_loss=0.04943, over 21308.00 frames. ], tot_loss[loss=0.2121, simple_loss=0.2919, pruned_loss=0.0661, over 3697506.38 frames. ], batch size: 131, lr: 2.47e-03, grad_scale: 32.0 2023-06-28 08:28:10,668 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.793e+02 7.672e+02 1.106e+03 1.472e+03 3.614e+03, threshold=2.212e+03, percent-clipped=11.0 2023-06-28 08:28:15,562 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.77 vs. limit=22.5 2023-06-28 08:28:34,890 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2015106.0, ans=0.1 2023-06-28 08:28:54,368 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.45 vs. limit=15.0 2023-06-28 08:29:45,555 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=2015286.0, ans=0.5 2023-06-28 08:29:53,467 INFO [train.py:996] (0/4) Epoch 12, batch 450, loss[loss=0.2586, simple_loss=0.3643, pruned_loss=0.07647, over 21686.00 frames. ], tot_loss[loss=0.2085, simple_loss=0.2872, pruned_loss=0.0649, over 3832609.88 frames. ], batch size: 414, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:30:03,654 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.49 vs. limit=22.5 2023-06-28 08:30:19,745 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=2015406.0, ans=0.2 2023-06-28 08:30:38,018 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2015466.0, ans=0.1 2023-06-28 08:31:11,218 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=2015526.0, ans=0.125 2023-06-28 08:31:11,804 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.65 vs. limit=22.5 2023-06-28 08:31:37,496 INFO [train.py:996] (0/4) Epoch 12, batch 500, loss[loss=0.2246, simple_loss=0.3301, pruned_loss=0.0595, over 21742.00 frames. ], tot_loss[loss=0.212, simple_loss=0.2948, pruned_loss=0.06467, over 3934537.76 frames. ], batch size: 332, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:31:42,491 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.576e+02 9.650e+02 1.378e+03 2.425e+03 6.087e+03, threshold=2.755e+03, percent-clipped=29.0 2023-06-28 08:32:39,141 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.38 vs. limit=15.0 2023-06-28 08:32:45,269 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2015766.0, ans=0.125 2023-06-28 08:33:16,023 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2015886.0, ans=0.125 2023-06-28 08:33:22,145 INFO [train.py:996] (0/4) Epoch 12, batch 550, loss[loss=0.2181, simple_loss=0.3345, pruned_loss=0.05081, over 19898.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2996, pruned_loss=0.06443, over 4016666.95 frames. ], batch size: 703, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:33:45,068 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-336000.pt 2023-06-28 08:33:47,917 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.84 vs. limit=22.5 2023-06-28 08:33:54,918 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.40 vs. limit=15.0 2023-06-28 08:35:00,648 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.57 vs. limit=15.0 2023-06-28 08:35:00,979 INFO [train.py:996] (0/4) Epoch 12, batch 600, loss[loss=0.2423, simple_loss=0.361, pruned_loss=0.06185, over 21239.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.303, pruned_loss=0.06515, over 4072396.16 frames. ], batch size: 548, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:35:05,857 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.203e+02 8.466e+02 1.434e+03 2.194e+03 5.258e+03, threshold=2.867e+03, percent-clipped=12.0 2023-06-28 08:36:07,090 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2016366.0, ans=0.1 2023-06-28 08:36:44,630 INFO [train.py:996] (0/4) Epoch 12, batch 650, loss[loss=0.2088, simple_loss=0.2776, pruned_loss=0.06999, over 19916.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.3001, pruned_loss=0.06544, over 4116980.45 frames. ], batch size: 704, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:37:22,584 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2016606.0, ans=0.125 2023-06-28 08:37:31,876 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2016666.0, ans=0.0 2023-06-28 08:37:48,268 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=2016726.0, ans=0.125 2023-06-28 08:37:48,335 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2016726.0, ans=0.125 2023-06-28 08:37:49,909 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2016726.0, ans=0.125 2023-06-28 08:38:00,031 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=2016726.0, ans=0.05 2023-06-28 08:38:03,358 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2016726.0, ans=0.125 2023-06-28 08:38:12,408 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.81 vs. limit=15.0 2023-06-28 08:38:17,440 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.95 vs. limit=10.0 2023-06-28 08:38:23,205 INFO [train.py:996] (0/4) Epoch 12, batch 700, loss[loss=0.2298, simple_loss=0.3116, pruned_loss=0.07401, over 21773.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2974, pruned_loss=0.06612, over 4148512.73 frames. ], batch size: 112, lr: 2.47e-03, grad_scale: 8.0 2023-06-28 08:38:34,700 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.083e+02 8.629e+02 1.370e+03 1.985e+03 4.368e+03, threshold=2.739e+03, percent-clipped=8.0 2023-06-28 08:39:06,460 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.33 vs. limit=15.0 2023-06-28 08:39:48,513 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=2017086.0, ans=0.125 2023-06-28 08:40:03,784 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=2017086.0, ans=0.125 2023-06-28 08:40:06,459 INFO [train.py:996] (0/4) Epoch 12, batch 750, loss[loss=0.2027, simple_loss=0.2648, pruned_loss=0.0703, over 21729.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2964, pruned_loss=0.06746, over 4179381.55 frames. ], batch size: 298, lr: 2.47e-03, grad_scale: 8.0 2023-06-28 08:40:30,315 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2017146.0, ans=0.125 2023-06-28 08:41:13,514 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 08:41:39,559 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.48 vs. limit=15.0 2023-06-28 08:41:41,336 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.82 vs. limit=15.0 2023-06-28 08:41:50,326 INFO [train.py:996] (0/4) Epoch 12, batch 800, loss[loss=0.1956, simple_loss=0.2659, pruned_loss=0.06264, over 21871.00 frames. ], tot_loss[loss=0.213, simple_loss=0.2916, pruned_loss=0.06719, over 4209128.22 frames. ], batch size: 283, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:42:01,928 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.954e+02 9.081e+02 1.260e+03 2.091e+03 4.459e+03, threshold=2.521e+03, percent-clipped=14.0 2023-06-28 08:42:18,915 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2017506.0, ans=0.1 2023-06-28 08:42:25,689 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=2017506.0, ans=0.04949747468305833 2023-06-28 08:42:27,115 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=2017506.0, ans=0.025 2023-06-28 08:42:54,282 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.42 vs. limit=12.0 2023-06-28 08:43:14,129 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.25 vs. limit=6.0 2023-06-28 08:43:30,371 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=2017686.0, ans=0.07 2023-06-28 08:43:33,339 INFO [train.py:996] (0/4) Epoch 12, batch 850, loss[loss=0.1965, simple_loss=0.2816, pruned_loss=0.05572, over 21651.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.2897, pruned_loss=0.06649, over 4232255.92 frames. ], batch size: 247, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:43:36,123 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.91 vs. limit=15.0 2023-06-28 08:44:47,573 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2017926.0, ans=0.125 2023-06-28 08:44:57,639 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=2017926.0, ans=0.125 2023-06-28 08:45:11,281 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2017986.0, ans=0.0 2023-06-28 08:45:21,872 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2017986.0, ans=0.125 2023-06-28 08:45:24,330 INFO [train.py:996] (0/4) Epoch 12, batch 900, loss[loss=0.1973, simple_loss=0.2927, pruned_loss=0.05091, over 21829.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2875, pruned_loss=0.06665, over 4248890.03 frames. ], batch size: 372, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:45:35,784 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.139e+02 7.978e+02 1.292e+03 1.942e+03 4.093e+03, threshold=2.584e+03, percent-clipped=13.0 2023-06-28 08:45:40,006 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2018046.0, ans=0.125 2023-06-28 08:46:40,490 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2018226.0, ans=0.125 2023-06-28 08:47:06,542 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=2018286.0, ans=0.2 2023-06-28 08:47:09,765 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2018286.0, ans=0.1 2023-06-28 08:47:13,361 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=2018346.0, ans=0.0 2023-06-28 08:47:14,378 INFO [train.py:996] (0/4) Epoch 12, batch 950, loss[loss=0.2212, simple_loss=0.3025, pruned_loss=0.06998, over 21864.00 frames. ], tot_loss[loss=0.2089, simple_loss=0.2862, pruned_loss=0.06582, over 4258397.85 frames. ], batch size: 107, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:47:14,907 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=2018346.0, ans=0.05 2023-06-28 08:47:27,236 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.46 vs. limit=12.0 2023-06-28 08:47:51,050 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=2018406.0, ans=0.0 2023-06-28 08:47:56,289 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=2018466.0, ans=0.0 2023-06-28 08:47:59,660 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=2018466.0, ans=0.04949747468305833 2023-06-28 08:48:29,712 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=2018586.0, ans=0.2 2023-06-28 08:48:56,776 INFO [train.py:996] (0/4) Epoch 12, batch 1000, loss[loss=0.242, simple_loss=0.3247, pruned_loss=0.07964, over 21376.00 frames. ], tot_loss[loss=0.2081, simple_loss=0.2864, pruned_loss=0.0649, over 4262917.06 frames. ], batch size: 131, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:49:03,706 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.940e+02 7.062e+02 8.970e+02 1.402e+03 3.868e+03, threshold=1.794e+03, percent-clipped=7.0 2023-06-28 08:49:11,189 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2018646.0, ans=0.125 2023-06-28 08:49:50,243 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 08:49:53,302 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=2018766.0, ans=0.015 2023-06-28 08:50:07,090 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2018826.0, ans=0.125 2023-06-28 08:50:42,110 INFO [train.py:996] (0/4) Epoch 12, batch 1050, loss[loss=0.192, simple_loss=0.2709, pruned_loss=0.05649, over 21327.00 frames. ], tot_loss[loss=0.2088, simple_loss=0.2865, pruned_loss=0.06556, over 4276274.58 frames. ], batch size: 176, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:51:04,689 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2019006.0, ans=0.1 2023-06-28 08:51:07,883 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2019006.0, ans=0.0 2023-06-28 08:52:20,405 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2019186.0, ans=0.125 2023-06-28 08:52:31,733 INFO [train.py:996] (0/4) Epoch 12, batch 1100, loss[loss=0.199, simple_loss=0.3026, pruned_loss=0.04768, over 21700.00 frames. ], tot_loss[loss=0.2098, simple_loss=0.289, pruned_loss=0.06532, over 4275669.81 frames. ], batch size: 247, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:52:39,017 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.826e+02 7.483e+02 1.102e+03 1.696e+03 3.574e+03, threshold=2.203e+03, percent-clipped=22.0 2023-06-28 08:54:17,194 INFO [train.py:996] (0/4) Epoch 12, batch 1150, loss[loss=0.2215, simple_loss=0.3101, pruned_loss=0.06642, over 21747.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2914, pruned_loss=0.06538, over 4279191.29 frames. ], batch size: 414, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:54:37,872 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=2019606.0, ans=0.2 2023-06-28 08:54:46,550 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=2019606.0, ans=0.2 2023-06-28 08:54:48,270 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2019606.0, ans=0.0 2023-06-28 08:54:50,446 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.64 vs. limit=15.0 2023-06-28 08:56:08,750 INFO [train.py:996] (0/4) Epoch 12, batch 1200, loss[loss=0.1907, simple_loss=0.2568, pruned_loss=0.06228, over 21266.00 frames. ], tot_loss[loss=0.2118, simple_loss=0.2918, pruned_loss=0.06589, over 4283418.96 frames. ], batch size: 608, lr: 2.47e-03, grad_scale: 32.0 2023-06-28 08:56:15,508 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.003e+02 8.397e+02 1.494e+03 2.117e+03 4.524e+03, threshold=2.987e+03, percent-clipped=23.0 2023-06-28 08:56:53,085 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=2019966.0, ans=10.0 2023-06-28 08:57:32,856 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_ff2.min_abs, batch_count=2020086.0, ans=0.1 2023-06-28 08:57:49,456 INFO [train.py:996] (0/4) Epoch 12, batch 1250, loss[loss=0.2244, simple_loss=0.3076, pruned_loss=0.07063, over 16654.00 frames. ], tot_loss[loss=0.2122, simple_loss=0.2915, pruned_loss=0.06644, over 4273309.16 frames. ], batch size: 61, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:58:14,483 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2020206.0, ans=0.0 2023-06-28 08:58:19,359 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=2020206.0, ans=0.125 2023-06-28 08:58:59,215 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.14 vs. limit=6.0 2023-06-28 08:59:32,773 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=2020386.0, ans=0.0 2023-06-28 08:59:40,401 INFO [train.py:996] (0/4) Epoch 12, batch 1300, loss[loss=0.2178, simple_loss=0.305, pruned_loss=0.06526, over 21756.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2926, pruned_loss=0.06728, over 4280242.69 frames. ], batch size: 351, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:59:46,531 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.33 vs. limit=15.0 2023-06-28 08:59:48,735 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.742e+02 7.744e+02 1.078e+03 1.630e+03 3.241e+03, threshold=2.156e+03, percent-clipped=1.0 2023-06-28 08:59:49,336 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=2020446.0, ans=0.125 2023-06-28 09:00:14,664 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2020566.0, ans=0.125 2023-06-28 09:00:45,347 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2020626.0, ans=0.1 2023-06-28 09:01:02,284 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2020686.0, ans=0.1 2023-06-28 09:01:25,445 INFO [train.py:996] (0/4) Epoch 12, batch 1350, loss[loss=0.2394, simple_loss=0.3153, pruned_loss=0.08179, over 21690.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.294, pruned_loss=0.06743, over 4284094.71 frames. ], batch size: 351, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 09:01:34,673 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=2020746.0, ans=0.0 2023-06-28 09:02:14,114 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.82 vs. limit=10.0 2023-06-28 09:02:23,439 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2020926.0, ans=0.0 2023-06-28 09:02:28,330 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=2020926.0, ans=0.05 2023-06-28 09:02:41,009 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.02 vs. limit=15.0 2023-06-28 09:02:42,241 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2020986.0, ans=0.125 2023-06-28 09:03:05,006 INFO [train.py:996] (0/4) Epoch 12, batch 1400, loss[loss=0.181, simple_loss=0.2446, pruned_loss=0.05874, over 21559.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2922, pruned_loss=0.06746, over 4281439.25 frames. ], batch size: 196, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 09:03:09,580 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.48 vs. limit=12.0 2023-06-28 09:03:13,309 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.716e+02 8.874e+02 1.255e+03 1.971e+03 3.857e+03, threshold=2.510e+03, percent-clipped=18.0 2023-06-28 09:03:21,534 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.73 vs. limit=15.0 2023-06-28 09:03:24,818 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.92 vs. limit=15.0 2023-06-28 09:04:05,614 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.26 vs. limit=12.0 2023-06-28 09:04:28,913 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=2021286.0, ans=0.2 2023-06-28 09:04:46,596 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.68 vs. limit=15.0 2023-06-28 09:04:50,281 INFO [train.py:996] (0/4) Epoch 12, batch 1450, loss[loss=0.2293, simple_loss=0.3129, pruned_loss=0.07286, over 21433.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.291, pruned_loss=0.06693, over 4281965.62 frames. ], batch size: 131, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 09:04:57,811 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2021346.0, ans=0.0 2023-06-28 09:05:14,572 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2021406.0, ans=0.125 2023-06-28 09:05:14,630 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=2021406.0, ans=0.0 2023-06-28 09:06:08,754 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2021526.0, ans=0.125 2023-06-28 09:06:27,628 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=2021586.0, ans=0.125 2023-06-28 09:06:34,574 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2021586.0, ans=0.0 2023-06-28 09:06:37,301 INFO [train.py:996] (0/4) Epoch 12, batch 1500, loss[loss=0.2141, simple_loss=0.294, pruned_loss=0.06709, over 21888.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2914, pruned_loss=0.06823, over 4286330.36 frames. ], batch size: 118, lr: 2.47e-03, grad_scale: 8.0 2023-06-28 09:06:47,789 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.815e+02 8.356e+02 1.274e+03 1.855e+03 4.343e+03, threshold=2.548e+03, percent-clipped=12.0 2023-06-28 09:06:48,587 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2021646.0, ans=0.125 2023-06-28 09:06:56,977 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=2021706.0, ans=0.2 2023-06-28 09:07:54,244 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.13 vs. limit=10.0 2023-06-28 09:08:04,254 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2021826.0, ans=0.125 2023-06-28 09:08:12,838 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2021886.0, ans=0.125 2023-06-28 09:08:12,904 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2021886.0, ans=0.125 2023-06-28 09:08:21,572 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=2021886.0, ans=0.125 2023-06-28 09:08:24,459 INFO [train.py:996] (0/4) Epoch 12, batch 1550, loss[loss=0.2055, simple_loss=0.2788, pruned_loss=0.06609, over 21492.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.2911, pruned_loss=0.06811, over 4277471.07 frames. ], batch size: 211, lr: 2.47e-03, grad_scale: 8.0 2023-06-28 09:08:25,165 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=2021946.0, ans=0.0 2023-06-28 09:08:35,461 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2021946.0, ans=0.0 2023-06-28 09:08:55,338 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.89 vs. limit=22.5 2023-06-28 09:09:16,479 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2022066.0, ans=0.125 2023-06-28 09:09:26,485 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2022066.0, ans=0.125 2023-06-28 09:09:55,272 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=2022186.0, ans=0.95 2023-06-28 09:10:00,490 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2022186.0, ans=0.1 2023-06-28 09:10:03,924 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2022186.0, ans=0.125 2023-06-28 09:10:05,582 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=2022186.0, ans=0.2 2023-06-28 09:10:08,935 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=2022246.0, ans=0.025 2023-06-28 09:10:09,936 INFO [train.py:996] (0/4) Epoch 12, batch 1600, loss[loss=0.1795, simple_loss=0.2484, pruned_loss=0.05528, over 21204.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.2904, pruned_loss=0.06794, over 4284036.91 frames. ], batch size: 159, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 09:10:10,470 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2022246.0, ans=0.125 2023-06-28 09:10:20,072 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.882e+02 7.910e+02 1.210e+03 1.920e+03 3.790e+03, threshold=2.419e+03, percent-clipped=9.0 2023-06-28 09:11:17,368 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.63 vs. limit=15.0 2023-06-28 09:11:45,136 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2022486.0, ans=0.1 2023-06-28 09:11:58,012 INFO [train.py:996] (0/4) Epoch 12, batch 1650, loss[loss=0.1761, simple_loss=0.2724, pruned_loss=0.03987, over 21771.00 frames. ], tot_loss[loss=0.212, simple_loss=0.2893, pruned_loss=0.06731, over 4282438.35 frames. ], batch size: 351, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 09:12:02,643 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.38 vs. limit=22.5 2023-06-28 09:12:42,050 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2022666.0, ans=0.1 2023-06-28 09:13:37,337 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2022786.0, ans=0.1 2023-06-28 09:13:45,608 INFO [train.py:996] (0/4) Epoch 12, batch 1700, loss[loss=0.2308, simple_loss=0.3091, pruned_loss=0.07625, over 21162.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2917, pruned_loss=0.06813, over 4279918.16 frames. ], batch size: 143, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 09:13:55,786 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.138e+02 6.859e+02 1.024e+03 1.407e+03 3.205e+03, threshold=2.048e+03, percent-clipped=5.0 2023-06-28 09:15:16,290 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.87 vs. limit=22.5 2023-06-28 09:15:32,584 INFO [train.py:996] (0/4) Epoch 12, batch 1750, loss[loss=0.1199, simple_loss=0.1763, pruned_loss=0.03176, over 17033.00 frames. ], tot_loss[loss=0.213, simple_loss=0.2915, pruned_loss=0.06724, over 4277930.07 frames. ], batch size: 60, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 09:15:44,126 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2023146.0, ans=0.125 2023-06-28 09:17:25,801 INFO [train.py:996] (0/4) Epoch 12, batch 1800, loss[loss=0.1991, simple_loss=0.3, pruned_loss=0.04907, over 21628.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.2895, pruned_loss=0.06444, over 4277598.56 frames. ], batch size: 230, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 09:17:46,588 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.284e+02 7.829e+02 1.190e+03 1.910e+03 4.483e+03, threshold=2.381e+03, percent-clipped=19.0 2023-06-28 09:18:01,586 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.84 vs. limit=15.0 2023-06-28 09:18:13,013 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=2023566.0, ans=0.0 2023-06-28 09:19:05,371 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2023686.0, ans=0.125 2023-06-28 09:19:11,629 INFO [train.py:996] (0/4) Epoch 12, batch 1850, loss[loss=0.22, simple_loss=0.2983, pruned_loss=0.07086, over 21557.00 frames. ], tot_loss[loss=0.2066, simple_loss=0.289, pruned_loss=0.06208, over 4279268.32 frames. ], batch size: 441, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 09:19:53,223 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2023806.0, ans=0.125 2023-06-28 09:19:58,456 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2023866.0, ans=0.0 2023-06-28 09:20:03,935 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=2023866.0, ans=0.05 2023-06-28 09:20:07,547 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.11 vs. limit=10.0 2023-06-28 09:20:12,133 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=2023926.0, ans=0.125 2023-06-28 09:20:27,441 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=2023926.0, ans=0.0 2023-06-28 09:20:59,508 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.67 vs. limit=15.0 2023-06-28 09:20:59,929 INFO [train.py:996] (0/4) Epoch 12, batch 1900, loss[loss=0.1905, simple_loss=0.2754, pruned_loss=0.05285, over 21764.00 frames. ], tot_loss[loss=0.209, simple_loss=0.2918, pruned_loss=0.06307, over 4277621.49 frames. ], batch size: 247, lr: 2.47e-03, grad_scale: 8.0 2023-06-28 09:21:22,276 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.886e+02 8.389e+02 1.357e+03 2.180e+03 3.591e+03, threshold=2.714e+03, percent-clipped=20.0 2023-06-28 09:21:34,974 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2024106.0, ans=0.125 2023-06-28 09:22:46,553 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=2024286.0, ans=0.0 2023-06-28 09:22:54,023 INFO [train.py:996] (0/4) Epoch 12, batch 1950, loss[loss=0.2097, simple_loss=0.287, pruned_loss=0.06627, over 21546.00 frames. ], tot_loss[loss=0.2073, simple_loss=0.2883, pruned_loss=0.06308, over 4269874.63 frames. ], batch size: 212, lr: 2.47e-03, grad_scale: 8.0 2023-06-28 09:23:18,787 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2024406.0, ans=0.125 2023-06-28 09:24:40,572 INFO [train.py:996] (0/4) Epoch 12, batch 2000, loss[loss=0.1873, simple_loss=0.2548, pruned_loss=0.05988, over 21714.00 frames. ], tot_loss[loss=0.2029, simple_loss=0.2824, pruned_loss=0.06174, over 4275491.65 frames. ], batch size: 299, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 09:24:52,581 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.543e+02 8.090e+02 1.262e+03 2.210e+03 4.405e+03, threshold=2.524e+03, percent-clipped=15.0 2023-06-28 09:24:53,954 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.81 vs. limit=15.0 2023-06-28 09:25:40,663 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2024826.0, ans=0.125 2023-06-28 09:26:12,974 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.12 vs. limit=15.0 2023-06-28 09:26:25,013 INFO [train.py:996] (0/4) Epoch 12, batch 2050, loss[loss=0.216, simple_loss=0.2968, pruned_loss=0.06762, over 21882.00 frames. ], tot_loss[loss=0.2022, simple_loss=0.2821, pruned_loss=0.06121, over 4272862.21 frames. ], batch size: 118, lr: 2.47e-03, grad_scale: 8.0 2023-06-28 09:26:36,367 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.45 vs. limit=8.0 2023-06-28 09:26:37,219 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2024946.0, ans=0.1 2023-06-28 09:26:45,424 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=2025006.0, ans=0.0 2023-06-28 09:26:55,330 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=2025006.0, ans=0.0 2023-06-28 09:27:08,403 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2025066.0, ans=0.1 2023-06-28 09:27:11,432 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2025066.0, ans=0.125 2023-06-28 09:27:42,113 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2025186.0, ans=0.0 2023-06-28 09:27:53,490 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=2025186.0, ans=0.2 2023-06-28 09:28:07,573 INFO [train.py:996] (0/4) Epoch 12, batch 2100, loss[loss=0.2277, simple_loss=0.31, pruned_loss=0.07269, over 21727.00 frames. ], tot_loss[loss=0.2058, simple_loss=0.2859, pruned_loss=0.06282, over 4278213.35 frames. ], batch size: 247, lr: 2.47e-03, grad_scale: 8.0 2023-06-28 09:28:21,405 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.339e+02 9.933e+02 1.500e+03 2.145e+03 4.437e+03, threshold=3.000e+03, percent-clipped=17.0 2023-06-28 09:28:39,001 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff3.min_abs, batch_count=2025306.0, ans=0.2 2023-06-28 09:29:20,147 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.65 vs. limit=5.0 2023-06-28 09:29:46,446 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2025486.0, ans=0.1 2023-06-28 09:29:52,491 INFO [train.py:996] (0/4) Epoch 12, batch 2150, loss[loss=0.2017, simple_loss=0.2746, pruned_loss=0.06446, over 21321.00 frames. ], tot_loss[loss=0.2072, simple_loss=0.2864, pruned_loss=0.06401, over 4275389.24 frames. ], batch size: 131, lr: 2.47e-03, grad_scale: 8.0 2023-06-28 09:29:57,929 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=2025546.0, ans=0.0 2023-06-28 09:29:59,938 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2025546.0, ans=0.1 2023-06-28 09:31:13,514 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=2025786.0, ans=0.5 2023-06-28 09:31:37,748 INFO [train.py:996] (0/4) Epoch 12, batch 2200, loss[loss=0.2301, simple_loss=0.3018, pruned_loss=0.07921, over 21792.00 frames. ], tot_loss[loss=0.2085, simple_loss=0.2868, pruned_loss=0.0651, over 4277667.61 frames. ], batch size: 441, lr: 2.46e-03, grad_scale: 8.0 2023-06-28 09:31:51,398 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.594e+02 7.144e+02 1.049e+03 1.524e+03 3.402e+03, threshold=2.098e+03, percent-clipped=4.0 2023-06-28 09:31:57,637 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.98 vs. limit=6.0 2023-06-28 09:32:08,765 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=2025906.0, ans=0.0 2023-06-28 09:32:52,342 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=2026026.0, ans=0.2 2023-06-28 09:33:16,623 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.70 vs. limit=15.0 2023-06-28 09:33:21,818 INFO [train.py:996] (0/4) Epoch 12, batch 2250, loss[loss=0.1724, simple_loss=0.2411, pruned_loss=0.05184, over 21756.00 frames. ], tot_loss[loss=0.2067, simple_loss=0.2857, pruned_loss=0.06388, over 4283535.54 frames. ], batch size: 118, lr: 2.46e-03, grad_scale: 8.0 2023-06-28 09:33:30,637 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=2026146.0, ans=0.2 2023-06-28 09:33:35,894 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=2026146.0, ans=0.125 2023-06-28 09:33:44,099 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2026206.0, ans=0.0 2023-06-28 09:34:15,970 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 09:34:45,272 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=2026386.0, ans=0.2 2023-06-28 09:35:06,611 INFO [train.py:996] (0/4) Epoch 12, batch 2300, loss[loss=0.1933, simple_loss=0.2574, pruned_loss=0.06467, over 21615.00 frames. ], tot_loss[loss=0.2039, simple_loss=0.2809, pruned_loss=0.06342, over 4274712.43 frames. ], batch size: 298, lr: 2.46e-03, grad_scale: 8.0 2023-06-28 09:35:20,310 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.470e+02 7.194e+02 1.165e+03 1.936e+03 3.464e+03, threshold=2.331e+03, percent-clipped=21.0 2023-06-28 09:35:26,420 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=2026506.0, ans=0.0 2023-06-28 09:35:55,010 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=2026566.0, ans=0.2 2023-06-28 09:35:55,021 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=2026566.0, ans=0.0 2023-06-28 09:36:09,401 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.32 vs. limit=10.0 2023-06-28 09:36:45,283 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=2026686.0, ans=0.2 2023-06-28 09:36:53,097 INFO [train.py:996] (0/4) Epoch 12, batch 2350, loss[loss=0.2277, simple_loss=0.2986, pruned_loss=0.07837, over 21827.00 frames. ], tot_loss[loss=0.205, simple_loss=0.2806, pruned_loss=0.0647, over 4276729.81 frames. ], batch size: 441, lr: 2.46e-03, grad_scale: 8.0 2023-06-28 09:37:36,645 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.98 vs. limit=12.0 2023-06-28 09:37:49,785 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=2026866.0, ans=0.125 2023-06-28 09:38:10,765 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=2026926.0, ans=0.0 2023-06-28 09:38:38,340 INFO [train.py:996] (0/4) Epoch 12, batch 2400, loss[loss=0.24, simple_loss=0.3408, pruned_loss=0.06962, over 17337.00 frames. ], tot_loss[loss=0.2097, simple_loss=0.2865, pruned_loss=0.06645, over 4274664.99 frames. ], batch size: 60, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 09:38:57,281 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.929e+02 7.615e+02 1.092e+03 1.757e+03 3.744e+03, threshold=2.185e+03, percent-clipped=12.0 2023-06-28 09:39:15,288 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=2027106.0, ans=0.2 2023-06-28 09:40:14,237 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=2027286.0, ans=0.0 2023-06-28 09:40:24,048 INFO [train.py:996] (0/4) Epoch 12, batch 2450, loss[loss=0.2067, simple_loss=0.2757, pruned_loss=0.06886, over 21880.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.2912, pruned_loss=0.06805, over 4278280.92 frames. ], batch size: 98, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 09:40:35,511 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.58 vs. limit=6.0 2023-06-28 09:41:17,402 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2027466.0, ans=0.125 2023-06-28 09:41:44,339 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2027526.0, ans=0.125 2023-06-28 09:42:04,553 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2027586.0, ans=0.1 2023-06-28 09:42:08,974 INFO [train.py:996] (0/4) Epoch 12, batch 2500, loss[loss=0.2053, simple_loss=0.3218, pruned_loss=0.04437, over 20887.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.2902, pruned_loss=0.06635, over 4276421.59 frames. ], batch size: 609, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 09:42:27,033 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.946e+02 7.873e+02 1.330e+03 1.943e+03 4.895e+03, threshold=2.659e+03, percent-clipped=18.0 2023-06-28 09:42:27,686 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=2027646.0, ans=10.0 2023-06-28 09:43:10,005 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=2027766.0, ans=0.2 2023-06-28 09:43:53,466 INFO [train.py:996] (0/4) Epoch 12, batch 2550, loss[loss=0.214, simple_loss=0.2964, pruned_loss=0.06578, over 21804.00 frames. ], tot_loss[loss=0.2086, simple_loss=0.287, pruned_loss=0.0651, over 4275427.02 frames. ], batch size: 124, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 09:44:09,239 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2027946.0, ans=0.125 2023-06-28 09:44:09,777 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.55 vs. limit=10.0 2023-06-28 09:44:12,454 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2027946.0, ans=0.125 2023-06-28 09:45:32,782 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2028186.0, ans=0.125 2023-06-28 09:45:37,020 INFO [train.py:996] (0/4) Epoch 12, batch 2600, loss[loss=0.2283, simple_loss=0.305, pruned_loss=0.07581, over 21734.00 frames. ], tot_loss[loss=0.2098, simple_loss=0.2874, pruned_loss=0.06609, over 4265982.03 frames. ], batch size: 332, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 09:45:55,672 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.364e+02 1.004e+03 1.411e+03 2.308e+03 3.873e+03, threshold=2.822e+03, percent-clipped=11.0 2023-06-28 09:46:07,000 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=14.20 vs. limit=15.0 2023-06-28 09:46:24,892 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2028366.0, ans=0.0 2023-06-28 09:46:48,800 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=2028426.0, ans=0.125 2023-06-28 09:47:17,624 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2028486.0, ans=0.125 2023-06-28 09:47:21,870 INFO [train.py:996] (0/4) Epoch 12, batch 2650, loss[loss=0.2111, simple_loss=0.2845, pruned_loss=0.06884, over 21723.00 frames. ], tot_loss[loss=0.2129, simple_loss=0.2903, pruned_loss=0.06776, over 4268912.69 frames. ], batch size: 389, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 09:48:16,281 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=2028666.0, ans=0.125 2023-06-28 09:48:17,782 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2028666.0, ans=0.125 2023-06-28 09:48:24,444 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=2028666.0, ans=0.125 2023-06-28 09:48:31,295 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2028726.0, ans=0.1 2023-06-28 09:49:06,751 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=2028846.0, ans=0.125 2023-06-28 09:49:07,743 INFO [train.py:996] (0/4) Epoch 12, batch 2700, loss[loss=0.1743, simple_loss=0.2264, pruned_loss=0.06107, over 20734.00 frames. ], tot_loss[loss=0.2103, simple_loss=0.287, pruned_loss=0.06681, over 4260429.08 frames. ], batch size: 607, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 09:49:10,623 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.65 vs. limit=15.0 2023-06-28 09:49:12,473 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.65 vs. limit=15.0 2023-06-28 09:49:25,946 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.017e+02 6.917e+02 8.915e+02 1.240e+03 3.062e+03, threshold=1.783e+03, percent-clipped=1.0 2023-06-28 09:50:51,160 INFO [train.py:996] (0/4) Epoch 12, batch 2750, loss[loss=0.2213, simple_loss=0.282, pruned_loss=0.08028, over 21539.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.2861, pruned_loss=0.06701, over 4261210.87 frames. ], batch size: 548, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 09:51:03,800 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=2029146.0, ans=0.125 2023-06-28 09:51:52,104 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 09:52:22,053 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.08 vs. limit=22.5 2023-06-28 09:52:43,516 INFO [train.py:996] (0/4) Epoch 12, batch 2800, loss[loss=0.2437, simple_loss=0.3123, pruned_loss=0.08758, over 21238.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.2924, pruned_loss=0.06839, over 4268061.53 frames. ], batch size: 176, lr: 2.46e-03, grad_scale: 32.0 2023-06-28 09:52:58,754 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.305e+02 8.151e+02 1.437e+03 2.226e+03 4.806e+03, threshold=2.874e+03, percent-clipped=38.0 2023-06-28 09:52:59,926 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.98 vs. limit=15.0 2023-06-28 09:53:37,520 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.34 vs. limit=15.0 2023-06-28 09:53:42,044 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=2029566.0, ans=0.125 2023-06-28 09:54:15,656 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=2029686.0, ans=0.0 2023-06-28 09:54:28,769 INFO [train.py:996] (0/4) Epoch 12, batch 2850, loss[loss=0.1664, simple_loss=0.243, pruned_loss=0.04487, over 21677.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.2924, pruned_loss=0.06868, over 4263808.76 frames. ], batch size: 247, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 09:54:54,196 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=2029806.0, ans=0.025 2023-06-28 09:55:04,179 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2029806.0, ans=0.0 2023-06-28 09:55:36,785 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.48 vs. limit=22.5 2023-06-28 09:55:54,604 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2029986.0, ans=0.125 2023-06-28 09:56:01,328 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=2029986.0, ans=0.125 2023-06-28 09:56:06,806 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.16 vs. limit=15.0 2023-06-28 09:56:12,486 INFO [train.py:996] (0/4) Epoch 12, batch 2900, loss[loss=0.1984, simple_loss=0.2903, pruned_loss=0.05328, over 21430.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2927, pruned_loss=0.06914, over 4267798.89 frames. ], batch size: 548, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 09:56:27,903 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.497e+02 8.368e+02 1.188e+03 2.037e+03 3.726e+03, threshold=2.377e+03, percent-clipped=4.0 2023-06-28 09:57:21,761 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=2030226.0, ans=0.2 2023-06-28 09:57:52,328 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=2030286.0, ans=0.0 2023-06-28 09:57:54,080 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2030286.0, ans=0.125 2023-06-28 09:57:56,780 INFO [train.py:996] (0/4) Epoch 12, batch 2950, loss[loss=0.2563, simple_loss=0.3212, pruned_loss=0.09569, over 21787.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2946, pruned_loss=0.0698, over 4272861.03 frames. ], batch size: 441, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 09:58:55,014 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2030466.0, ans=0.1 2023-06-28 09:59:40,597 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=2030646.0, ans=0.04949747468305833 2023-06-28 09:59:41,598 INFO [train.py:996] (0/4) Epoch 12, batch 3000, loss[loss=0.2396, simple_loss=0.3128, pruned_loss=0.08323, over 21505.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.2989, pruned_loss=0.07026, over 4279111.44 frames. ], batch size: 194, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 09:59:41,599 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-28 10:00:03,546 INFO [train.py:1028] (0/4) Epoch 12, validation: loss=0.2539, simple_loss=0.3416, pruned_loss=0.08306, over 1796401.00 frames. 2023-06-28 10:00:03,547 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23714MB 2023-06-28 10:00:24,304 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.619e+02 8.195e+02 1.192e+03 1.732e+03 4.635e+03, threshold=2.384e+03, percent-clipped=12.0 2023-06-28 10:01:06,488 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2030766.0, ans=0.1 2023-06-28 10:01:09,474 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2030826.0, ans=0.0 2023-06-28 10:01:12,804 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2030826.0, ans=0.1 2023-06-28 10:01:42,907 INFO [train.py:996] (0/4) Epoch 12, batch 3050, loss[loss=0.238, simple_loss=0.3362, pruned_loss=0.06988, over 21342.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2995, pruned_loss=0.06904, over 4279003.68 frames. ], batch size: 548, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:02:03,403 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=2030946.0, ans=0.125 2023-06-28 10:02:05,068 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2030946.0, ans=0.0 2023-06-28 10:02:16,636 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2031006.0, ans=0.1 2023-06-28 10:02:45,312 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2031066.0, ans=0.1 2023-06-28 10:03:18,280 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2031186.0, ans=0.125 2023-06-28 10:03:37,790 INFO [train.py:996] (0/4) Epoch 12, batch 3100, loss[loss=0.2049, simple_loss=0.3072, pruned_loss=0.05132, over 21239.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2987, pruned_loss=0.06814, over 4278829.98 frames. ], batch size: 548, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:03:57,009 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.760e+02 7.796e+02 1.121e+03 1.860e+03 4.097e+03, threshold=2.242e+03, percent-clipped=9.0 2023-06-28 10:03:57,597 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2031306.0, ans=0.1 2023-06-28 10:04:08,107 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 10:04:25,983 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.96 vs. limit=15.0 2023-06-28 10:05:27,750 INFO [train.py:996] (0/4) Epoch 12, batch 3150, loss[loss=0.2369, simple_loss=0.3367, pruned_loss=0.06854, over 21662.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2994, pruned_loss=0.06801, over 4277336.53 frames. ], batch size: 441, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:05:55,022 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=2031606.0, ans=0.125 2023-06-28 10:06:49,401 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2031786.0, ans=0.1 2023-06-28 10:07:01,644 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.52 vs. limit=15.0 2023-06-28 10:07:12,552 INFO [train.py:996] (0/4) Epoch 12, batch 3200, loss[loss=0.2219, simple_loss=0.3212, pruned_loss=0.06135, over 21227.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2987, pruned_loss=0.06796, over 4270593.58 frames. ], batch size: 548, lr: 2.46e-03, grad_scale: 32.0 2023-06-28 10:07:32,469 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.058e+02 7.728e+02 1.156e+03 1.759e+03 4.154e+03, threshold=2.311e+03, percent-clipped=17.0 2023-06-28 10:07:46,691 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2031906.0, ans=0.125 2023-06-28 10:08:50,845 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=2032086.0, ans=10.0 2023-06-28 10:09:00,237 INFO [train.py:996] (0/4) Epoch 12, batch 3250, loss[loss=0.2061, simple_loss=0.2809, pruned_loss=0.06565, over 21990.00 frames. ], tot_loss[loss=0.2181, simple_loss=0.2989, pruned_loss=0.06869, over 4273785.19 frames. ], batch size: 103, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:09:43,046 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2032266.0, ans=0.1 2023-06-28 10:10:04,702 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=2032326.0, ans=0.0 2023-06-28 10:10:11,805 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.61 vs. limit=15.0 2023-06-28 10:10:39,242 INFO [train.py:996] (0/4) Epoch 12, batch 3300, loss[loss=0.242, simple_loss=0.3188, pruned_loss=0.08263, over 21364.00 frames. ], tot_loss[loss=0.215, simple_loss=0.2937, pruned_loss=0.06812, over 4264128.39 frames. ], batch size: 549, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:10:56,003 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.931e+02 8.073e+02 1.537e+03 2.186e+03 4.176e+03, threshold=3.073e+03, percent-clipped=21.0 2023-06-28 10:11:44,164 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=2032626.0, ans=0.125 2023-06-28 10:11:54,206 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=2032626.0, ans=0.5 2023-06-28 10:12:23,344 INFO [train.py:996] (0/4) Epoch 12, batch 3350, loss[loss=0.2005, simple_loss=0.2938, pruned_loss=0.05361, over 21729.00 frames. ], tot_loss[loss=0.2156, simple_loss=0.2954, pruned_loss=0.06793, over 4272688.09 frames. ], batch size: 247, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:12:29,237 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2032746.0, ans=0.1 2023-06-28 10:12:40,857 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2032806.0, ans=0.125 2023-06-28 10:12:46,357 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=2032806.0, ans=0.025 2023-06-28 10:13:11,587 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=2032866.0, ans=22.5 2023-06-28 10:13:53,812 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=2032986.0, ans=0.2 2023-06-28 10:14:06,585 INFO [train.py:996] (0/4) Epoch 12, batch 3400, loss[loss=0.2249, simple_loss=0.2948, pruned_loss=0.07753, over 21373.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2947, pruned_loss=0.06792, over 4280605.45 frames. ], batch size: 471, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:14:07,172 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2033046.0, ans=0.125 2023-06-28 10:14:09,495 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.69 vs. limit=12.0 2023-06-28 10:14:28,083 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.882e+02 7.652e+02 1.057e+03 1.709e+03 3.627e+03, threshold=2.113e+03, percent-clipped=2.0 2023-06-28 10:15:11,751 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2033226.0, ans=0.125 2023-06-28 10:15:50,722 INFO [train.py:996] (0/4) Epoch 12, batch 3450, loss[loss=0.1835, simple_loss=0.2536, pruned_loss=0.0567, over 21147.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.2897, pruned_loss=0.0667, over 4279578.40 frames. ], batch size: 608, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:16:08,431 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=2033346.0, ans=0.05 2023-06-28 10:16:14,276 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.77 vs. limit=15.0 2023-06-28 10:17:06,703 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2033526.0, ans=0.125 2023-06-28 10:17:13,561 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=2033526.0, ans=0.125 2023-06-28 10:17:32,751 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2033586.0, ans=0.125 2023-06-28 10:17:35,133 INFO [train.py:996] (0/4) Epoch 12, batch 3500, loss[loss=0.2515, simple_loss=0.3303, pruned_loss=0.08638, over 21775.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.2999, pruned_loss=0.07068, over 4281783.30 frames. ], batch size: 247, lr: 2.46e-03, grad_scale: 8.0 2023-06-28 10:18:03,137 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.129e+02 8.730e+02 1.318e+03 1.854e+03 3.895e+03, threshold=2.636e+03, percent-clipped=20.0 2023-06-28 10:18:29,757 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.31 vs. limit=6.0 2023-06-28 10:19:07,582 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=2033886.0, ans=0.95 2023-06-28 10:19:14,553 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2033886.0, ans=0.125 2023-06-28 10:19:23,775 INFO [train.py:996] (0/4) Epoch 12, batch 3550, loss[loss=0.2522, simple_loss=0.2934, pruned_loss=0.1055, over 21295.00 frames. ], tot_loss[loss=0.223, simple_loss=0.3017, pruned_loss=0.07211, over 4274040.94 frames. ], batch size: 507, lr: 2.46e-03, grad_scale: 8.0 2023-06-28 10:19:42,693 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=2033946.0, ans=0.2 2023-06-28 10:20:03,619 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2034066.0, ans=0.125 2023-06-28 10:20:16,592 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=2034066.0, ans=0.125 2023-06-28 10:20:30,287 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=2034126.0, ans=0.04949747468305833 2023-06-28 10:20:31,889 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2034126.0, ans=0.125 2023-06-28 10:21:12,850 INFO [train.py:996] (0/4) Epoch 12, batch 3600, loss[loss=0.2202, simple_loss=0.2757, pruned_loss=0.08231, over 21223.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.296, pruned_loss=0.07104, over 4266550.93 frames. ], batch size: 471, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:21:17,119 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=2034246.0, ans=0.2 2023-06-28 10:21:30,745 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2034306.0, ans=0.0 2023-06-28 10:21:30,768 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=2034306.0, ans=0.125 2023-06-28 10:21:31,746 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.278e+02 7.988e+02 1.219e+03 1.896e+03 5.241e+03, threshold=2.438e+03, percent-clipped=11.0 2023-06-28 10:22:08,630 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.73 vs. limit=15.0 2023-06-28 10:22:18,500 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.79 vs. limit=15.0 2023-06-28 10:22:24,397 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2034486.0, ans=0.1 2023-06-28 10:22:51,687 INFO [train.py:996] (0/4) Epoch 12, batch 3650, loss[loss=0.225, simple_loss=0.2821, pruned_loss=0.08393, over 21504.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.295, pruned_loss=0.07099, over 4268238.02 frames. ], batch size: 441, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:23:43,708 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2034666.0, ans=0.125 2023-06-28 10:24:04,983 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2034726.0, ans=0.1 2023-06-28 10:24:15,239 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2034786.0, ans=0.125 2023-06-28 10:24:33,950 INFO [train.py:996] (0/4) Epoch 12, batch 3700, loss[loss=0.2185, simple_loss=0.3035, pruned_loss=0.06681, over 21625.00 frames. ], tot_loss[loss=0.217, simple_loss=0.294, pruned_loss=0.07004, over 4270762.74 frames. ], batch size: 230, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:24:51,028 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2034846.0, ans=0.1 2023-06-28 10:24:57,051 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.964e+02 7.431e+02 1.073e+03 1.535e+03 4.329e+03, threshold=2.147e+03, percent-clipped=8.0 2023-06-28 10:25:21,532 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.81 vs. limit=15.0 2023-06-28 10:25:51,395 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2035086.0, ans=0.0 2023-06-28 10:26:17,526 INFO [train.py:996] (0/4) Epoch 12, batch 3750, loss[loss=0.1758, simple_loss=0.2574, pruned_loss=0.04711, over 21755.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2935, pruned_loss=0.06985, over 4278899.90 frames. ], batch size: 298, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:26:50,702 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.67 vs. limit=15.0 2023-06-28 10:26:51,998 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=2035206.0, ans=0.0 2023-06-28 10:26:59,534 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.67 vs. limit=15.0 2023-06-28 10:27:06,240 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2035266.0, ans=0.0 2023-06-28 10:27:52,658 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.36 vs. limit=22.5 2023-06-28 10:27:57,847 INFO [train.py:996] (0/4) Epoch 12, batch 3800, loss[loss=0.2042, simple_loss=0.28, pruned_loss=0.0642, over 21630.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2915, pruned_loss=0.06852, over 4279350.80 frames. ], batch size: 263, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:28:21,925 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.866e+02 7.287e+02 1.012e+03 1.468e+03 2.920e+03, threshold=2.024e+03, percent-clipped=9.0 2023-06-28 10:28:27,205 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_na.min_abs, batch_count=2035506.0, ans=0.02 2023-06-28 10:28:27,351 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=2035506.0, ans=0.2 2023-06-28 10:28:43,666 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2035566.0, ans=0.0 2023-06-28 10:28:56,870 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=2035626.0, ans=0.04949747468305833 2023-06-28 10:29:40,099 INFO [train.py:996] (0/4) Epoch 12, batch 3850, loss[loss=0.1796, simple_loss=0.2443, pruned_loss=0.0574, over 21320.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2903, pruned_loss=0.06887, over 4277761.71 frames. ], batch size: 144, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:30:10,974 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2035806.0, ans=0.125 2023-06-28 10:30:56,577 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.45 vs. limit=15.0 2023-06-28 10:31:23,418 INFO [train.py:996] (0/4) Epoch 12, batch 3900, loss[loss=0.1977, simple_loss=0.2711, pruned_loss=0.06213, over 21646.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.2874, pruned_loss=0.06868, over 4277365.11 frames. ], batch size: 263, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:31:47,273 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.137e+02 7.075e+02 9.098e+02 1.343e+03 3.131e+03, threshold=1.820e+03, percent-clipped=11.0 2023-06-28 10:31:54,607 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2036106.0, ans=0.125 2023-06-28 10:32:10,578 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 10:32:52,765 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2036286.0, ans=0.125 2023-06-28 10:33:08,680 INFO [train.py:996] (0/4) Epoch 12, batch 3950, loss[loss=0.1616, simple_loss=0.2381, pruned_loss=0.04257, over 21160.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2884, pruned_loss=0.06835, over 4280988.10 frames. ], batch size: 143, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:33:55,313 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.85 vs. limit=15.0 2023-06-28 10:34:52,696 INFO [train.py:996] (0/4) Epoch 12, batch 4000, loss[loss=0.1801, simple_loss=0.242, pruned_loss=0.05906, over 21292.00 frames. ], tot_loss[loss=0.207, simple_loss=0.2825, pruned_loss=0.06572, over 4280776.56 frames. ], batch size: 160, lr: 2.46e-03, grad_scale: 32.0 2023-06-28 10:35:16,142 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.215e+02 7.767e+02 1.100e+03 1.663e+03 3.671e+03, threshold=2.200e+03, percent-clipped=20.0 2023-06-28 10:35:26,057 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.44 vs. limit=22.5 2023-06-28 10:35:37,653 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2036766.0, ans=0.125 2023-06-28 10:36:19,370 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=2036886.0, ans=0.05 2023-06-28 10:36:29,262 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2036886.0, ans=0.0 2023-06-28 10:36:35,200 INFO [train.py:996] (0/4) Epoch 12, batch 4050, loss[loss=0.1942, simple_loss=0.2822, pruned_loss=0.05305, over 21819.00 frames. ], tot_loss[loss=0.2066, simple_loss=0.2839, pruned_loss=0.06458, over 4287793.35 frames. ], batch size: 351, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:36:51,335 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.50 vs. limit=12.0 2023-06-28 10:36:52,517 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2036946.0, ans=0.0 2023-06-28 10:37:18,604 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=2037066.0, ans=0.125 2023-06-28 10:38:18,407 INFO [train.py:996] (0/4) Epoch 12, batch 4100, loss[loss=0.2024, simple_loss=0.2865, pruned_loss=0.05913, over 21819.00 frames. ], tot_loss[loss=0.2073, simple_loss=0.2852, pruned_loss=0.06466, over 4294839.00 frames. ], batch size: 282, lr: 2.46e-03, grad_scale: 8.0 2023-06-28 10:38:45,580 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.840e+02 7.700e+02 1.227e+03 1.924e+03 4.359e+03, threshold=2.455e+03, percent-clipped=14.0 2023-06-28 10:38:48,122 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2037306.0, ans=0.1 2023-06-28 10:39:03,035 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2037366.0, ans=0.125 2023-06-28 10:40:06,902 INFO [train.py:996] (0/4) Epoch 12, batch 4150, loss[loss=0.1727, simple_loss=0.261, pruned_loss=0.04222, over 21611.00 frames. ], tot_loss[loss=0.2049, simple_loss=0.2853, pruned_loss=0.06228, over 4296001.02 frames. ], batch size: 263, lr: 2.46e-03, grad_scale: 8.0 2023-06-28 10:40:09,322 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=2037546.0, ans=0.0 2023-06-28 10:40:15,182 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.46 vs. limit=15.0 2023-06-28 10:40:17,840 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=2037546.0, ans=0.125 2023-06-28 10:40:55,883 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.64 vs. limit=12.0 2023-06-28 10:41:29,595 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=2037786.0, ans=0.05 2023-06-28 10:41:29,599 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2037786.0, ans=0.125 2023-06-28 10:41:52,364 INFO [train.py:996] (0/4) Epoch 12, batch 4200, loss[loss=0.1879, simple_loss=0.2703, pruned_loss=0.05274, over 21505.00 frames. ], tot_loss[loss=0.206, simple_loss=0.2876, pruned_loss=0.06221, over 4286366.59 frames. ], batch size: 195, lr: 2.46e-03, grad_scale: 8.0 2023-06-28 10:41:57,988 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=2037846.0, ans=0.125 2023-06-28 10:41:59,916 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=2037846.0, ans=0.0 2023-06-28 10:42:14,625 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.777e+02 8.293e+02 1.484e+03 2.185e+03 3.637e+03, threshold=2.967e+03, percent-clipped=18.0 2023-06-28 10:42:51,716 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2037966.0, ans=0.0 2023-06-28 10:43:22,659 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2038086.0, ans=0.125 2023-06-28 10:43:37,194 INFO [train.py:996] (0/4) Epoch 12, batch 4250, loss[loss=0.2521, simple_loss=0.3393, pruned_loss=0.08238, over 21587.00 frames. ], tot_loss[loss=0.2098, simple_loss=0.2927, pruned_loss=0.06351, over 4286956.26 frames. ], batch size: 414, lr: 2.46e-03, grad_scale: 8.0 2023-06-28 10:43:41,232 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=2038146.0, ans=0.0 2023-06-28 10:44:16,276 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=2038206.0, ans=0.09899494936611666 2023-06-28 10:45:24,217 INFO [train.py:996] (0/4) Epoch 12, batch 4300, loss[loss=0.2177, simple_loss=0.3063, pruned_loss=0.06457, over 21227.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2974, pruned_loss=0.06517, over 4287033.68 frames. ], batch size: 549, lr: 2.46e-03, grad_scale: 8.0 2023-06-28 10:45:39,971 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=2038446.0, ans=0.2 2023-06-28 10:46:00,803 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.611e+02 9.355e+02 1.305e+03 1.983e+03 5.098e+03, threshold=2.609e+03, percent-clipped=8.0 2023-06-28 10:46:30,003 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=2038566.0, ans=0.025 2023-06-28 10:47:01,760 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.26 vs. limit=15.0 2023-06-28 10:47:12,519 INFO [train.py:996] (0/4) Epoch 12, batch 4350, loss[loss=0.1733, simple_loss=0.2423, pruned_loss=0.05215, over 21460.00 frames. ], tot_loss[loss=0.2133, simple_loss=0.2974, pruned_loss=0.06459, over 4280855.87 frames. ], batch size: 230, lr: 2.46e-03, grad_scale: 8.0 2023-06-28 10:47:18,770 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.14 vs. limit=15.0 2023-06-28 10:47:26,337 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2038746.0, ans=0.1 2023-06-28 10:47:48,446 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=2038806.0, ans=0.0 2023-06-28 10:48:03,477 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=2038866.0, ans=0.0 2023-06-28 10:48:24,335 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=2038926.0, ans=0.07 2023-06-28 10:48:25,725 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2038926.0, ans=0.0 2023-06-28 10:48:50,364 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=2038986.0, ans=0.125 2023-06-28 10:49:03,201 INFO [train.py:996] (0/4) Epoch 12, batch 4400, loss[loss=0.2047, simple_loss=0.2884, pruned_loss=0.06048, over 21270.00 frames. ], tot_loss[loss=0.211, simple_loss=0.2935, pruned_loss=0.06424, over 4284967.40 frames. ], batch size: 176, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:49:35,014 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.000e+02 1.052e+03 1.456e+03 1.843e+03 4.869e+03, threshold=2.912e+03, percent-clipped=14.0 2023-06-28 10:49:39,816 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.59 vs. limit=15.0 2023-06-28 10:49:46,419 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=2039166.0, ans=0.0 2023-06-28 10:50:03,710 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=2039226.0, ans=0.125 2023-06-28 10:50:53,926 INFO [train.py:996] (0/4) Epoch 12, batch 4450, loss[loss=0.3086, simple_loss=0.3855, pruned_loss=0.1159, over 21513.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.3018, pruned_loss=0.06679, over 4283838.37 frames. ], batch size: 507, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:51:06,309 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2039346.0, ans=0.0 2023-06-28 10:51:09,520 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=2039406.0, ans=0.0 2023-06-28 10:51:34,798 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2039466.0, ans=0.0 2023-06-28 10:52:30,243 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2039586.0, ans=0.0 2023-06-28 10:52:38,112 INFO [train.py:996] (0/4) Epoch 12, batch 4500, loss[loss=0.2237, simple_loss=0.2912, pruned_loss=0.07808, over 21282.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.3024, pruned_loss=0.06868, over 4289490.54 frames. ], batch size: 143, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:52:40,422 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=2039646.0, ans=0.0 2023-06-28 10:52:43,826 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=2039646.0, ans=0.07 2023-06-28 10:53:04,871 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.459e+02 9.304e+02 1.246e+03 2.301e+03 3.917e+03, threshold=2.492e+03, percent-clipped=11.0 2023-06-28 10:53:06,032 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.58 vs. limit=15.0 2023-06-28 10:53:10,895 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=2039706.0, ans=0.2 2023-06-28 10:53:17,607 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=2039766.0, ans=0.125 2023-06-28 10:54:28,126 INFO [train.py:996] (0/4) Epoch 12, batch 4550, loss[loss=0.2214, simple_loss=0.3073, pruned_loss=0.06773, over 21786.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.304, pruned_loss=0.06863, over 4289500.65 frames. ], batch size: 247, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:54:30,724 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=2039946.0, ans=0.2 2023-06-28 10:54:42,242 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-340000.pt 2023-06-28 10:55:11,281 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=2040066.0, ans=0.2 2023-06-28 10:56:07,906 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=2040186.0, ans=0.2 2023-06-28 10:56:14,153 INFO [train.py:996] (0/4) Epoch 12, batch 4600, loss[loss=0.1822, simple_loss=0.2656, pruned_loss=0.04939, over 21756.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.3053, pruned_loss=0.07018, over 4289695.24 frames. ], batch size: 247, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:56:14,792 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2040246.0, ans=0.125 2023-06-28 10:56:14,859 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=2040246.0, ans=0.0 2023-06-28 10:56:36,681 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.360e+02 7.467e+02 1.139e+03 1.677e+03 2.825e+03, threshold=2.277e+03, percent-clipped=5.0 2023-06-28 10:56:57,177 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 10:57:12,027 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=2040366.0, ans=0.125 2023-06-28 10:57:27,142 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2040426.0, ans=0.125 2023-06-28 10:57:58,191 INFO [train.py:996] (0/4) Epoch 12, batch 4650, loss[loss=0.166, simple_loss=0.247, pruned_loss=0.04251, over 21771.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2994, pruned_loss=0.06826, over 4288962.80 frames. ], batch size: 371, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:58:05,368 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=2040546.0, ans=0.125 2023-06-28 10:59:20,434 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.30 vs. limit=22.5 2023-06-28 10:59:40,592 INFO [train.py:996] (0/4) Epoch 12, batch 4700, loss[loss=0.2029, simple_loss=0.2714, pruned_loss=0.06722, over 21803.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.2914, pruned_loss=0.06672, over 4290429.06 frames. ], batch size: 118, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:59:50,040 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=2040846.0, ans=0.0 2023-06-28 10:59:54,083 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.34 vs. limit=22.5 2023-06-28 11:00:07,733 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.876e+02 7.682e+02 1.181e+03 1.934e+03 4.585e+03, threshold=2.362e+03, percent-clipped=15.0 2023-06-28 11:00:13,601 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=2040906.0, ans=0.95 2023-06-28 11:01:23,178 INFO [train.py:996] (0/4) Epoch 12, batch 4750, loss[loss=0.1762, simple_loss=0.2456, pruned_loss=0.05338, over 21674.00 frames. ], tot_loss[loss=0.2108, simple_loss=0.2883, pruned_loss=0.06662, over 4278889.64 frames. ], batch size: 264, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 11:02:09,355 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.58 vs. limit=22.5 2023-06-28 11:02:42,479 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.84 vs. limit=15.0 2023-06-28 11:03:05,686 INFO [train.py:996] (0/4) Epoch 12, batch 4800, loss[loss=0.1708, simple_loss=0.2345, pruned_loss=0.0535, over 21579.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2871, pruned_loss=0.0668, over 4278959.20 frames. ], batch size: 213, lr: 2.46e-03, grad_scale: 32.0 2023-06-28 11:03:32,402 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.216e+02 8.084e+02 1.278e+03 1.855e+03 4.015e+03, threshold=2.556e+03, percent-clipped=12.0 2023-06-28 11:03:50,958 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=2041566.0, ans=0.2 2023-06-28 11:04:18,574 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2041626.0, ans=0.0 2023-06-28 11:04:47,269 INFO [train.py:996] (0/4) Epoch 12, batch 4850, loss[loss=0.2304, simple_loss=0.2992, pruned_loss=0.08081, over 21717.00 frames. ], tot_loss[loss=0.2089, simple_loss=0.2858, pruned_loss=0.06603, over 4283846.38 frames. ], batch size: 441, lr: 2.46e-03, grad_scale: 32.0 2023-06-28 11:04:51,119 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2041746.0, ans=0.1 2023-06-28 11:05:14,441 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2041806.0, ans=0.125 2023-06-28 11:05:34,009 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=2041866.0, ans=0.04949747468305833 2023-06-28 11:05:35,819 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2041866.0, ans=0.125 2023-06-28 11:05:56,436 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.29 vs. limit=22.5 2023-06-28 11:06:30,300 INFO [train.py:996] (0/4) Epoch 12, batch 4900, loss[loss=0.2178, simple_loss=0.2968, pruned_loss=0.06944, over 21329.00 frames. ], tot_loss[loss=0.2097, simple_loss=0.2866, pruned_loss=0.0664, over 4285224.13 frames. ], batch size: 176, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 11:06:58,460 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.185e+02 7.394e+02 1.193e+03 1.925e+03 4.019e+03, threshold=2.386e+03, percent-clipped=10.0 2023-06-28 11:07:00,929 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=2042106.0, ans=0.2 2023-06-28 11:07:42,722 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2042226.0, ans=0.0 2023-06-28 11:08:01,011 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 11:08:14,032 INFO [train.py:996] (0/4) Epoch 12, batch 4950, loss[loss=0.179, simple_loss=0.2886, pruned_loss=0.03475, over 21167.00 frames. ], tot_loss[loss=0.21, simple_loss=0.2898, pruned_loss=0.06512, over 4282753.79 frames. ], batch size: 548, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:09:01,900 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=2042466.0, ans=0.2 2023-06-28 11:09:42,428 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2042586.0, ans=0.1 2023-06-28 11:09:54,802 INFO [train.py:996] (0/4) Epoch 12, batch 5000, loss[loss=0.1912, simple_loss=0.2668, pruned_loss=0.05778, over 20146.00 frames. ], tot_loss[loss=0.2061, simple_loss=0.288, pruned_loss=0.06212, over 4271329.75 frames. ], batch size: 703, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:10:22,975 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.667e+02 7.098e+02 1.009e+03 1.573e+03 3.184e+03, threshold=2.017e+03, percent-clipped=11.0 2023-06-28 11:10:34,650 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=2042706.0, ans=0.05 2023-06-28 11:10:37,704 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=2042766.0, ans=0.035 2023-06-28 11:10:47,661 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=2042766.0, ans=0.2 2023-06-28 11:11:06,830 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=2042826.0, ans=0.2 2023-06-28 11:11:22,990 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2042886.0, ans=0.125 2023-06-28 11:11:35,585 INFO [train.py:996] (0/4) Epoch 12, batch 5050, loss[loss=0.2258, simple_loss=0.2932, pruned_loss=0.07923, over 21716.00 frames. ], tot_loss[loss=0.2083, simple_loss=0.2889, pruned_loss=0.06381, over 4277272.09 frames. ], batch size: 473, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:11:41,033 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2042946.0, ans=0.0 2023-06-28 11:13:13,907 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.23 vs. limit=15.0 2023-06-28 11:13:17,778 INFO [train.py:996] (0/4) Epoch 12, batch 5100, loss[loss=0.1898, simple_loss=0.2641, pruned_loss=0.05776, over 21805.00 frames. ], tot_loss[loss=0.2074, simple_loss=0.2869, pruned_loss=0.06393, over 4287515.51 frames. ], batch size: 102, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:13:23,917 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.28 vs. limit=15.0 2023-06-28 11:13:44,954 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.63 vs. limit=15.0 2023-06-28 11:13:45,416 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.047e+02 7.915e+02 1.019e+03 1.431e+03 3.420e+03, threshold=2.039e+03, percent-clipped=6.0 2023-06-28 11:14:14,119 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=2043366.0, ans=0.125 2023-06-28 11:14:15,896 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=2043426.0, ans=0.0 2023-06-28 11:14:34,047 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2043426.0, ans=0.125 2023-06-28 11:14:39,249 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2043486.0, ans=0.0 2023-06-28 11:15:00,448 INFO [train.py:996] (0/4) Epoch 12, batch 5150, loss[loss=0.2103, simple_loss=0.2882, pruned_loss=0.06618, over 17166.00 frames. ], tot_loss[loss=0.2081, simple_loss=0.2862, pruned_loss=0.065, over 4289729.25 frames. ], batch size: 60, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:16:38,442 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=2043786.0, ans=0.015 2023-06-28 11:16:44,546 INFO [train.py:996] (0/4) Epoch 12, batch 5200, loss[loss=0.2076, simple_loss=0.3096, pruned_loss=0.0528, over 21644.00 frames. ], tot_loss[loss=0.2089, simple_loss=0.2871, pruned_loss=0.06539, over 4288167.14 frames. ], batch size: 263, lr: 2.45e-03, grad_scale: 32.0 2023-06-28 11:16:50,171 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=2043846.0, ans=0.0 2023-06-28 11:16:51,957 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2043846.0, ans=0.125 2023-06-28 11:17:18,856 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.436e+02 7.444e+02 1.331e+03 2.729e+03 6.291e+03, threshold=2.663e+03, percent-clipped=30.0 2023-06-28 11:17:19,621 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2043906.0, ans=0.125 2023-06-28 11:18:07,230 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2044026.0, ans=0.125 2023-06-28 11:18:17,214 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2044086.0, ans=0.125 2023-06-28 11:18:26,595 INFO [train.py:996] (0/4) Epoch 12, batch 5250, loss[loss=0.2373, simple_loss=0.319, pruned_loss=0.07779, over 21697.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2911, pruned_loss=0.06463, over 4290004.61 frames. ], batch size: 414, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:19:11,472 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.96 vs. limit=15.0 2023-06-28 11:20:08,092 INFO [train.py:996] (0/4) Epoch 12, batch 5300, loss[loss=0.2634, simple_loss=0.3151, pruned_loss=0.1058, over 21796.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2902, pruned_loss=0.06531, over 4298231.32 frames. ], batch size: 508, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:20:11,946 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=2044446.0, ans=0.125 2023-06-28 11:20:19,439 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=2044446.0, ans=0.125 2023-06-28 11:20:21,410 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=2044446.0, ans=0.0 2023-06-28 11:20:42,500 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.313e+02 7.509e+02 1.039e+03 1.571e+03 3.451e+03, threshold=2.078e+03, percent-clipped=7.0 2023-06-28 11:20:51,051 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=2044566.0, ans=0.125 2023-06-28 11:21:20,480 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2044626.0, ans=0.125 2023-06-28 11:21:26,661 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2044626.0, ans=0.0 2023-06-28 11:21:31,662 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=2044686.0, ans=0.0 2023-06-28 11:21:45,823 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=2044686.0, ans=0.0 2023-06-28 11:21:48,603 INFO [train.py:996] (0/4) Epoch 12, batch 5350, loss[loss=0.2281, simple_loss=0.3074, pruned_loss=0.07436, over 21722.00 frames. ], tot_loss[loss=0.211, simple_loss=0.289, pruned_loss=0.06652, over 4308318.42 frames. ], batch size: 112, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:22:03,746 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=2044746.0, ans=0.125 2023-06-28 11:22:05,585 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2044746.0, ans=0.125 2023-06-28 11:22:26,310 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2044806.0, ans=0.125 2023-06-28 11:22:51,178 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=2044926.0, ans=0.0 2023-06-28 11:23:24,966 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=2044986.0, ans=0.5 2023-06-28 11:23:35,439 INFO [train.py:996] (0/4) Epoch 12, batch 5400, loss[loss=0.1938, simple_loss=0.2671, pruned_loss=0.06031, over 21875.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2869, pruned_loss=0.0679, over 4310846.76 frames. ], batch size: 332, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:23:48,704 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.68 vs. limit=15.0 2023-06-28 11:24:03,411 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=2045106.0, ans=0.2 2023-06-28 11:24:05,989 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.403e+02 8.319e+02 1.196e+03 1.782e+03 3.222e+03, threshold=2.392e+03, percent-clipped=18.0 2023-06-28 11:24:10,295 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2045106.0, ans=0.1 2023-06-28 11:24:25,548 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=2045166.0, ans=0.0 2023-06-28 11:24:50,629 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.01 vs. limit=10.0 2023-06-28 11:24:55,119 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2045226.0, ans=0.125 2023-06-28 11:25:19,471 INFO [train.py:996] (0/4) Epoch 12, batch 5450, loss[loss=0.1961, simple_loss=0.286, pruned_loss=0.05307, over 21796.00 frames. ], tot_loss[loss=0.2107, simple_loss=0.2885, pruned_loss=0.06647, over 4307182.01 frames. ], batch size: 124, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:25:26,228 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2045346.0, ans=0.125 2023-06-28 11:25:31,576 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=2045346.0, ans=0.0 2023-06-28 11:25:34,899 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=2045346.0, ans=0.035 2023-06-28 11:26:11,023 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=2045466.0, ans=0.125 2023-06-28 11:26:17,650 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2045466.0, ans=0.125 2023-06-28 11:26:40,148 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.71 vs. limit=15.0 2023-06-28 11:26:57,815 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2045586.0, ans=0.125 2023-06-28 11:27:08,762 INFO [train.py:996] (0/4) Epoch 12, batch 5500, loss[loss=0.183, simple_loss=0.272, pruned_loss=0.04702, over 21422.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.2941, pruned_loss=0.06348, over 4301904.41 frames. ], batch size: 211, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:27:44,011 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.944e+02 8.580e+02 1.207e+03 1.863e+03 4.637e+03, threshold=2.413e+03, percent-clipped=15.0 2023-06-28 11:27:46,694 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.40 vs. limit=15.0 2023-06-28 11:27:49,462 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2045706.0, ans=0.0 2023-06-28 11:28:38,572 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2045886.0, ans=0.125 2023-06-28 11:28:57,734 INFO [train.py:996] (0/4) Epoch 12, batch 5550, loss[loss=0.1847, simple_loss=0.2962, pruned_loss=0.03665, over 21163.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.2958, pruned_loss=0.0613, over 4296428.28 frames. ], batch size: 548, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:29:06,013 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.60 vs. limit=15.0 2023-06-28 11:30:30,213 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2046186.0, ans=0.125 2023-06-28 11:30:46,193 INFO [train.py:996] (0/4) Epoch 12, batch 5600, loss[loss=0.2321, simple_loss=0.3388, pruned_loss=0.0627, over 21774.00 frames. ], tot_loss[loss=0.206, simple_loss=0.2938, pruned_loss=0.05906, over 4290851.72 frames. ], batch size: 351, lr: 2.45e-03, grad_scale: 32.0 2023-06-28 11:30:58,873 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2046246.0, ans=0.125 2023-06-28 11:31:13,173 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.841e+02 9.110e+02 1.414e+03 2.313e+03 5.859e+03, threshold=2.829e+03, percent-clipped=23.0 2023-06-28 11:31:24,617 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 11:31:41,955 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=2046366.0, ans=0.05 2023-06-28 11:31:45,809 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.10 vs. limit=22.5 2023-06-28 11:32:17,955 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2046486.0, ans=0.125 2023-06-28 11:32:27,080 INFO [train.py:996] (0/4) Epoch 12, batch 5650, loss[loss=0.2189, simple_loss=0.2977, pruned_loss=0.07006, over 21721.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2976, pruned_loss=0.06156, over 4288335.95 frames. ], batch size: 389, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:32:52,840 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2046606.0, ans=0.125 2023-06-28 11:33:02,633 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2046606.0, ans=0.1 2023-06-28 11:33:26,787 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=2046726.0, ans=0.125 2023-06-28 11:33:35,351 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2046726.0, ans=0.125 2023-06-28 11:33:39,383 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.68 vs. limit=6.0 2023-06-28 11:33:47,055 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=2046786.0, ans=0.2 2023-06-28 11:33:50,843 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.36 vs. limit=15.0 2023-06-28 11:34:09,873 INFO [train.py:996] (0/4) Epoch 12, batch 5700, loss[loss=0.2599, simple_loss=0.3393, pruned_loss=0.0903, over 21499.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.2961, pruned_loss=0.06323, over 4292569.05 frames. ], batch size: 508, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:34:34,824 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.56 vs. limit=10.0 2023-06-28 11:34:42,203 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.668e+02 8.634e+02 1.270e+03 1.811e+03 3.578e+03, threshold=2.540e+03, percent-clipped=6.0 2023-06-28 11:35:08,566 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2046966.0, ans=0.0 2023-06-28 11:35:14,835 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=2047026.0, ans=0.0 2023-06-28 11:35:18,515 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.68 vs. limit=12.0 2023-06-28 11:35:54,494 INFO [train.py:996] (0/4) Epoch 12, batch 5750, loss[loss=0.2258, simple_loss=0.335, pruned_loss=0.05833, over 19779.00 frames. ], tot_loss[loss=0.2079, simple_loss=0.295, pruned_loss=0.06043, over 4280999.12 frames. ], batch size: 702, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:37:25,469 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2047386.0, ans=0.0 2023-06-28 11:37:27,378 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=2047386.0, ans=0.0 2023-06-28 11:37:43,053 INFO [train.py:996] (0/4) Epoch 12, batch 5800, loss[loss=0.2275, simple_loss=0.3274, pruned_loss=0.06378, over 21592.00 frames. ], tot_loss[loss=0.2047, simple_loss=0.2919, pruned_loss=0.0587, over 4274831.09 frames. ], batch size: 441, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:38:03,278 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=2047506.0, ans=0.04949747468305833 2023-06-28 11:38:14,541 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.621e+02 6.881e+02 1.222e+03 1.758e+03 3.677e+03, threshold=2.444e+03, percent-clipped=11.0 2023-06-28 11:38:34,523 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2047566.0, ans=0.0 2023-06-28 11:38:43,493 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2047566.0, ans=0.125 2023-06-28 11:38:47,603 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.55 vs. limit=10.0 2023-06-28 11:38:48,523 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2047626.0, ans=0.0 2023-06-28 11:39:31,921 INFO [train.py:996] (0/4) Epoch 12, batch 5850, loss[loss=0.2411, simple_loss=0.3297, pruned_loss=0.07623, over 20111.00 frames. ], tot_loss[loss=0.2034, simple_loss=0.2932, pruned_loss=0.0568, over 4263816.74 frames. ], batch size: 702, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:39:34,100 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff3.min_abs, batch_count=2047746.0, ans=0.2 2023-06-28 11:39:41,532 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.22 vs. limit=22.5 2023-06-28 11:39:43,093 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.26 vs. limit=22.5 2023-06-28 11:40:09,070 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=2047806.0, ans=0.0 2023-06-28 11:40:53,083 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2047986.0, ans=0.125 2023-06-28 11:40:54,600 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2047986.0, ans=0.125 2023-06-28 11:41:08,981 INFO [train.py:996] (0/4) Epoch 12, batch 5900, loss[loss=0.1855, simple_loss=0.2661, pruned_loss=0.0525, over 21306.00 frames. ], tot_loss[loss=0.1946, simple_loss=0.2859, pruned_loss=0.0517, over 4263721.97 frames. ], batch size: 159, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:41:09,578 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2048046.0, ans=0.0 2023-06-28 11:41:18,830 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2048046.0, ans=0.125 2023-06-28 11:41:25,580 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=2048046.0, ans=0.125 2023-06-28 11:41:27,949 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.13 vs. limit=15.0 2023-06-28 11:41:44,142 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.278e+02 9.930e+02 1.759e+03 2.367e+03 3.954e+03, threshold=3.519e+03, percent-clipped=21.0 2023-06-28 11:42:54,192 INFO [train.py:996] (0/4) Epoch 12, batch 5950, loss[loss=0.1972, simple_loss=0.2652, pruned_loss=0.06457, over 21919.00 frames. ], tot_loss[loss=0.196, simple_loss=0.2841, pruned_loss=0.05395, over 4263285.69 frames. ], batch size: 373, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:43:03,102 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=2048346.0, ans=0.125 2023-06-28 11:43:12,990 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2048346.0, ans=0.125 2023-06-28 11:43:14,767 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2048406.0, ans=0.125 2023-06-28 11:43:16,789 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.51 vs. limit=15.0 2023-06-28 11:43:31,706 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.63 vs. limit=22.5 2023-06-28 11:44:09,947 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.50 vs. limit=12.0 2023-06-28 11:44:36,737 INFO [train.py:996] (0/4) Epoch 12, batch 6000, loss[loss=0.1984, simple_loss=0.2625, pruned_loss=0.06708, over 21267.00 frames. ], tot_loss[loss=0.1975, simple_loss=0.2814, pruned_loss=0.05681, over 4257183.76 frames. ], batch size: 159, lr: 2.45e-03, grad_scale: 32.0 2023-06-28 11:44:36,739 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-28 11:44:56,183 INFO [zipformer.py:1728] (0/4) name=encoder.encoders.5.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([2.3042, 2.1087, 3.5045, 3.2814], device='cuda:0') 2023-06-28 11:44:57,243 INFO [train.py:1028] (0/4) Epoch 12, validation: loss=0.2597, simple_loss=0.3509, pruned_loss=0.08424, over 1796401.00 frames. 2023-06-28 11:44:57,244 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23714MB 2023-06-28 11:44:59,752 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2048646.0, ans=0.125 2023-06-28 11:45:28,553 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.837e+02 9.369e+02 1.291e+03 2.028e+03 3.757e+03, threshold=2.582e+03, percent-clipped=1.0 2023-06-28 11:45:32,945 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=2048706.0, ans=0.125 2023-06-28 11:46:15,816 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2048826.0, ans=0.125 2023-06-28 11:46:24,235 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=2048886.0, ans=0.125 2023-06-28 11:46:40,059 INFO [train.py:996] (0/4) Epoch 12, batch 6050, loss[loss=0.1698, simple_loss=0.2495, pruned_loss=0.04505, over 21607.00 frames. ], tot_loss[loss=0.1957, simple_loss=0.2757, pruned_loss=0.05779, over 4247203.25 frames. ], batch size: 415, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:47:04,980 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.33 vs. limit=12.0 2023-06-28 11:47:07,904 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2049006.0, ans=0.125 2023-06-28 11:47:28,882 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=2049066.0, ans=0.0 2023-06-28 11:47:54,930 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.01 vs. limit=22.5 2023-06-28 11:47:58,044 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.28 vs. limit=12.0 2023-06-28 11:48:02,785 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2049186.0, ans=0.125 2023-06-28 11:48:28,627 INFO [train.py:996] (0/4) Epoch 12, batch 6100, loss[loss=0.2035, simple_loss=0.2857, pruned_loss=0.06064, over 21871.00 frames. ], tot_loss[loss=0.194, simple_loss=0.2748, pruned_loss=0.0566, over 4260240.83 frames. ], batch size: 332, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:48:31,101 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=2049246.0, ans=0.0 2023-06-28 11:48:47,621 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2049306.0, ans=0.1 2023-06-28 11:48:57,066 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.586e+02 8.433e+02 1.328e+03 2.179e+03 5.742e+03, threshold=2.657e+03, percent-clipped=17.0 2023-06-28 11:49:21,579 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2049366.0, ans=0.0 2023-06-28 11:50:13,357 INFO [train.py:996] (0/4) Epoch 12, batch 6150, loss[loss=0.2036, simple_loss=0.2746, pruned_loss=0.06633, over 21261.00 frames. ], tot_loss[loss=0.1968, simple_loss=0.2761, pruned_loss=0.05878, over 4265769.48 frames. ], batch size: 159, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:50:47,597 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=2049666.0, ans=0.0 2023-06-28 11:50:47,740 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=2049666.0, ans=0.125 2023-06-28 11:50:58,808 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.75 vs. limit=15.0 2023-06-28 11:51:56,279 INFO [train.py:996] (0/4) Epoch 12, batch 6200, loss[loss=0.32, simple_loss=0.3978, pruned_loss=0.1211, over 21652.00 frames. ], tot_loss[loss=0.1986, simple_loss=0.279, pruned_loss=0.05911, over 4265754.38 frames. ], batch size: 509, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:51:58,517 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2049846.0, ans=0.0 2023-06-28 11:52:14,797 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2049906.0, ans=0.0 2023-06-28 11:52:32,345 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.502e+02 7.829e+02 1.153e+03 1.728e+03 4.252e+03, threshold=2.307e+03, percent-clipped=8.0 2023-06-28 11:52:39,876 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 11:52:57,933 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2050026.0, ans=0.125 2023-06-28 11:53:16,866 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2050026.0, ans=0.1 2023-06-28 11:53:19,217 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.30 vs. limit=22.5 2023-06-28 11:53:41,406 INFO [train.py:996] (0/4) Epoch 12, batch 6250, loss[loss=0.2187, simple_loss=0.3247, pruned_loss=0.05636, over 21674.00 frames. ], tot_loss[loss=0.202, simple_loss=0.2855, pruned_loss=0.05925, over 4274791.09 frames. ], batch size: 441, lr: 2.45e-03, grad_scale: 8.0 2023-06-28 11:53:52,279 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2050146.0, ans=0.125 2023-06-28 11:54:05,228 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2050206.0, ans=0.125 2023-06-28 11:54:18,767 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=2050266.0, ans=0.0 2023-06-28 11:54:37,474 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=3.99 vs. limit=15.0 2023-06-28 11:55:16,077 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 11:55:23,838 INFO [train.py:996] (0/4) Epoch 12, batch 6300, loss[loss=0.2005, simple_loss=0.3227, pruned_loss=0.0391, over 20784.00 frames. ], tot_loss[loss=0.2024, simple_loss=0.2882, pruned_loss=0.05826, over 4270439.79 frames. ], batch size: 608, lr: 2.45e-03, grad_scale: 8.0 2023-06-28 11:55:31,246 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2050446.0, ans=0.125 2023-06-28 11:56:03,346 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.650e+02 7.162e+02 1.070e+03 1.625e+03 2.845e+03, threshold=2.140e+03, percent-clipped=5.0 2023-06-28 11:56:44,577 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=2050626.0, ans=0.125 2023-06-28 11:57:05,266 INFO [train.py:996] (0/4) Epoch 12, batch 6350, loss[loss=0.2492, simple_loss=0.3277, pruned_loss=0.08533, over 21423.00 frames. ], tot_loss[loss=0.2067, simple_loss=0.2905, pruned_loss=0.06152, over 4272320.71 frames. ], batch size: 131, lr: 2.45e-03, grad_scale: 8.0 2023-06-28 11:57:27,340 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 11:57:47,279 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=2050866.0, ans=0.125 2023-06-28 11:57:59,652 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.36 vs. limit=12.0 2023-06-28 11:58:54,047 INFO [train.py:996] (0/4) Epoch 12, batch 6400, loss[loss=0.2424, simple_loss=0.3201, pruned_loss=0.08237, over 21458.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.296, pruned_loss=0.06556, over 4271449.85 frames. ], batch size: 211, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:59:04,749 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2051046.0, ans=0.0 2023-06-28 11:59:27,062 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=2051106.0, ans=0.07 2023-06-28 11:59:29,781 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.784e+02 8.222e+02 1.150e+03 1.542e+03 3.199e+03, threshold=2.299e+03, percent-clipped=10.0 2023-06-28 11:59:51,704 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=2051166.0, ans=0.0 2023-06-28 12:00:22,858 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=2051286.0, ans=0.0 2023-06-28 12:00:36,734 INFO [train.py:996] (0/4) Epoch 12, batch 6450, loss[loss=0.2196, simple_loss=0.3017, pruned_loss=0.06874, over 21827.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2974, pruned_loss=0.06519, over 4275870.91 frames. ], batch size: 102, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:00:53,463 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 12:01:49,077 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2051526.0, ans=0.1 2023-06-28 12:02:06,482 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=2051586.0, ans=10.0 2023-06-28 12:02:20,336 INFO [train.py:996] (0/4) Epoch 12, batch 6500, loss[loss=0.2234, simple_loss=0.2876, pruned_loss=0.07964, over 21836.00 frames. ], tot_loss[loss=0.2095, simple_loss=0.2903, pruned_loss=0.06441, over 4279996.09 frames. ], batch size: 98, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:02:35,718 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2051646.0, ans=0.125 2023-06-28 12:02:59,807 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.006e+02 7.341e+02 1.379e+03 1.907e+03 4.704e+03, threshold=2.757e+03, percent-clipped=17.0 2023-06-28 12:03:40,462 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2051826.0, ans=0.0 2023-06-28 12:04:03,581 INFO [train.py:996] (0/4) Epoch 12, batch 6550, loss[loss=0.2019, simple_loss=0.28, pruned_loss=0.06185, over 21733.00 frames. ], tot_loss[loss=0.2079, simple_loss=0.2884, pruned_loss=0.06373, over 4279631.62 frames. ], batch size: 247, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:04:23,477 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2052006.0, ans=0.125 2023-06-28 12:04:33,164 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=2052006.0, ans=0.2 2023-06-28 12:04:57,213 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2052066.0, ans=0.0 2023-06-28 12:04:57,336 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2052066.0, ans=0.0 2023-06-28 12:05:05,578 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 12:05:08,912 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2052126.0, ans=0.125 2023-06-28 12:05:35,231 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 12:05:44,397 INFO [train.py:996] (0/4) Epoch 12, batch 6600, loss[loss=0.1877, simple_loss=0.2552, pruned_loss=0.06006, over 21745.00 frames. ], tot_loss[loss=0.2055, simple_loss=0.2836, pruned_loss=0.06369, over 4280686.33 frames. ], batch size: 300, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:06:28,656 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.752e+02 7.717e+02 1.174e+03 1.589e+03 2.955e+03, threshold=2.349e+03, percent-clipped=1.0 2023-06-28 12:06:40,911 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=2052366.0, ans=0.125 2023-06-28 12:06:43,456 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.68 vs. limit=15.0 2023-06-28 12:06:44,371 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=2052366.0, ans=0.5 2023-06-28 12:06:49,309 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2052426.0, ans=0.0 2023-06-28 12:07:12,568 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=2052486.0, ans=0.2 2023-06-28 12:07:32,096 INFO [train.py:996] (0/4) Epoch 12, batch 6650, loss[loss=0.1749, simple_loss=0.252, pruned_loss=0.0489, over 21671.00 frames. ], tot_loss[loss=0.2003, simple_loss=0.2779, pruned_loss=0.0613, over 4270709.21 frames. ], batch size: 298, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:07:46,621 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.70 vs. limit=22.5 2023-06-28 12:07:57,809 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=2052606.0, ans=22.5 2023-06-28 12:08:20,568 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.50 vs. limit=15.0 2023-06-28 12:09:13,056 INFO [train.py:996] (0/4) Epoch 12, batch 6700, loss[loss=0.2306, simple_loss=0.2958, pruned_loss=0.08267, over 21554.00 frames. ], tot_loss[loss=0.1979, simple_loss=0.2743, pruned_loss=0.06074, over 4268425.01 frames. ], batch size: 442, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:09:52,367 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.639e+02 7.163e+02 1.028e+03 1.473e+03 3.561e+03, threshold=2.056e+03, percent-clipped=9.0 2023-06-28 12:09:54,747 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2052966.0, ans=0.125 2023-06-28 12:10:51,363 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=2053086.0, ans=0.125 2023-06-28 12:10:53,892 INFO [train.py:996] (0/4) Epoch 12, batch 6750, loss[loss=0.1785, simple_loss=0.2513, pruned_loss=0.05289, over 21455.00 frames. ], tot_loss[loss=0.198, simple_loss=0.2733, pruned_loss=0.06138, over 4274897.69 frames. ], batch size: 212, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:11:00,891 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=2053146.0, ans=0.0 2023-06-28 12:11:38,596 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.89 vs. limit=12.0 2023-06-28 12:11:41,308 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2053266.0, ans=0.125 2023-06-28 12:12:01,516 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=2053326.0, ans=6.0 2023-06-28 12:12:33,680 INFO [train.py:996] (0/4) Epoch 12, batch 6800, loss[loss=0.2022, simple_loss=0.2756, pruned_loss=0.06444, over 21474.00 frames. ], tot_loss[loss=0.2012, simple_loss=0.2765, pruned_loss=0.06295, over 4276667.51 frames. ], batch size: 212, lr: 2.45e-03, grad_scale: 32.0 2023-06-28 12:12:56,858 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=2053506.0, ans=0.2 2023-06-28 12:13:13,852 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.991e+02 6.929e+02 1.207e+03 2.029e+03 5.012e+03, threshold=2.414e+03, percent-clipped=24.0 2023-06-28 12:13:21,191 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=2053566.0, ans=0.125 2023-06-28 12:13:32,542 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=2053626.0, ans=0.0 2023-06-28 12:13:53,899 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2053686.0, ans=0.125 2023-06-28 12:14:02,762 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.54 vs. limit=6.0 2023-06-28 12:14:13,234 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 12:14:14,441 INFO [train.py:996] (0/4) Epoch 12, batch 6850, loss[loss=0.2249, simple_loss=0.2825, pruned_loss=0.08368, over 21722.00 frames. ], tot_loss[loss=0.2012, simple_loss=0.2743, pruned_loss=0.06405, over 4272403.99 frames. ], batch size: 414, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:14:16,501 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=2053746.0, ans=0.125 2023-06-28 12:15:15,674 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=2053926.0, ans=0.05 2023-06-28 12:15:19,463 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=2053926.0, ans=0.0 2023-06-28 12:15:58,204 INFO [train.py:996] (0/4) Epoch 12, batch 6900, loss[loss=0.2013, simple_loss=0.2723, pruned_loss=0.06516, over 21340.00 frames. ], tot_loss[loss=0.2021, simple_loss=0.2749, pruned_loss=0.06466, over 4280141.20 frames. ], batch size: 159, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:16:39,822 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.832e+02 6.265e+02 8.270e+02 1.384e+03 3.220e+03, threshold=1.654e+03, percent-clipped=7.0 2023-06-28 12:17:23,848 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2054286.0, ans=0.1 2023-06-28 12:17:25,392 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=2054286.0, ans=0.125 2023-06-28 12:17:45,886 INFO [train.py:996] (0/4) Epoch 12, batch 6950, loss[loss=0.2614, simple_loss=0.3271, pruned_loss=0.0978, over 21434.00 frames. ], tot_loss[loss=0.2002, simple_loss=0.277, pruned_loss=0.06174, over 4281202.74 frames. ], batch size: 471, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:17:53,170 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=2054346.0, ans=0.125 2023-06-28 12:18:01,038 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=2054346.0, ans=0.0 2023-06-28 12:19:28,440 INFO [train.py:996] (0/4) Epoch 12, batch 7000, loss[loss=0.205, simple_loss=0.2779, pruned_loss=0.06609, over 21725.00 frames. ], tot_loss[loss=0.2044, simple_loss=0.2799, pruned_loss=0.06444, over 4282095.49 frames. ], batch size: 351, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:19:51,550 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.80 vs. limit=15.0 2023-06-28 12:20:05,478 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.352e+02 8.399e+02 1.085e+03 1.441e+03 2.628e+03, threshold=2.170e+03, percent-clipped=15.0 2023-06-28 12:20:23,677 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2054766.0, ans=0.125 2023-06-28 12:21:03,468 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=2054886.0, ans=0.2 2023-06-28 12:21:08,466 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=2054886.0, ans=0.0 2023-06-28 12:21:16,085 INFO [train.py:996] (0/4) Epoch 12, batch 7050, loss[loss=0.1723, simple_loss=0.2605, pruned_loss=0.04205, over 21748.00 frames. ], tot_loss[loss=0.2008, simple_loss=0.2765, pruned_loss=0.06251, over 4278135.45 frames. ], batch size: 247, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:22:38,464 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2055186.0, ans=0.125 2023-06-28 12:23:00,216 INFO [train.py:996] (0/4) Epoch 12, batch 7100, loss[loss=0.1604, simple_loss=0.2386, pruned_loss=0.04107, over 16348.00 frames. ], tot_loss[loss=0.2066, simple_loss=0.2836, pruned_loss=0.06484, over 4271429.15 frames. ], batch size: 61, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:23:36,452 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.448e+02 7.505e+02 1.150e+03 1.796e+03 3.717e+03, threshold=2.300e+03, percent-clipped=14.0 2023-06-28 12:24:20,195 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2055486.0, ans=0.125 2023-06-28 12:24:42,351 INFO [train.py:996] (0/4) Epoch 12, batch 7150, loss[loss=0.2645, simple_loss=0.3295, pruned_loss=0.09977, over 21381.00 frames. ], tot_loss[loss=0.202, simple_loss=0.2798, pruned_loss=0.06207, over 4262971.88 frames. ], batch size: 471, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:24:46,422 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2055546.0, ans=0.125 2023-06-28 12:24:53,062 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=2055546.0, ans=0.125 2023-06-28 12:25:04,385 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=2055606.0, ans=0.015 2023-06-28 12:25:06,235 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=2055606.0, ans=0.0 2023-06-28 12:25:52,892 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=2055726.0, ans=0.0 2023-06-28 12:26:02,808 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=2055726.0, ans=0.0 2023-06-28 12:26:12,492 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2055786.0, ans=0.125 2023-06-28 12:26:25,297 INFO [train.py:996] (0/4) Epoch 12, batch 7200, loss[loss=0.2339, simple_loss=0.31, pruned_loss=0.07888, over 21875.00 frames. ], tot_loss[loss=0.2052, simple_loss=0.2824, pruned_loss=0.06404, over 4263135.05 frames. ], batch size: 107, lr: 2.45e-03, grad_scale: 32.0 2023-06-28 12:26:30,183 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.41 vs. limit=12.0 2023-06-28 12:26:40,613 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2055846.0, ans=0.125 2023-06-28 12:26:52,689 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 12:27:08,261 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.252e+02 8.659e+02 1.185e+03 1.756e+03 3.819e+03, threshold=2.369e+03, percent-clipped=13.0 2023-06-28 12:27:29,390 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.26 vs. limit=15.0 2023-06-28 12:27:33,558 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2056026.0, ans=0.0 2023-06-28 12:28:12,249 INFO [train.py:996] (0/4) Epoch 12, batch 7250, loss[loss=0.2248, simple_loss=0.2815, pruned_loss=0.08407, over 21488.00 frames. ], tot_loss[loss=0.2043, simple_loss=0.2794, pruned_loss=0.0646, over 4267585.49 frames. ], batch size: 441, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:28:50,948 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2056266.0, ans=0.125 2023-06-28 12:29:52,475 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2056446.0, ans=0.125 2023-06-28 12:29:53,560 INFO [train.py:996] (0/4) Epoch 12, batch 7300, loss[loss=0.2054, simple_loss=0.2561, pruned_loss=0.07731, over 21305.00 frames. ], tot_loss[loss=0.2005, simple_loss=0.2733, pruned_loss=0.06383, over 4265143.31 frames. ], batch size: 473, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:30:07,117 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2056446.0, ans=0.1 2023-06-28 12:30:31,643 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.740e+02 7.948e+02 1.183e+03 1.586e+03 3.750e+03, threshold=2.367e+03, percent-clipped=12.0 2023-06-28 12:30:52,127 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2056626.0, ans=0.125 2023-06-28 12:31:07,688 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2056626.0, ans=0.125 2023-06-28 12:31:27,506 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=2056686.0, ans=0.0 2023-06-28 12:31:31,972 INFO [train.py:996] (0/4) Epoch 12, batch 7350, loss[loss=0.2006, simple_loss=0.2744, pruned_loss=0.06335, over 21790.00 frames. ], tot_loss[loss=0.1994, simple_loss=0.2709, pruned_loss=0.06399, over 4260288.94 frames. ], batch size: 247, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:31:58,100 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2056806.0, ans=0.0 2023-06-28 12:32:37,390 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=2056926.0, ans=0.0 2023-06-28 12:32:46,406 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.23 vs. limit=15.0 2023-06-28 12:32:50,158 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.08 vs. limit=15.0 2023-06-28 12:32:51,471 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=2056926.0, ans=0.09899494936611666 2023-06-28 12:33:08,300 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2056986.0, ans=0.125 2023-06-28 12:33:09,880 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=2056986.0, ans=0.2 2023-06-28 12:33:17,330 INFO [train.py:996] (0/4) Epoch 12, batch 7400, loss[loss=0.1922, simple_loss=0.2864, pruned_loss=0.04903, over 21732.00 frames. ], tot_loss[loss=0.2045, simple_loss=0.2783, pruned_loss=0.06537, over 4261702.52 frames. ], batch size: 332, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:33:34,470 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2057046.0, ans=0.125 2023-06-28 12:34:05,872 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.109e+02 7.290e+02 9.953e+02 1.415e+03 2.956e+03, threshold=1.991e+03, percent-clipped=1.0 2023-06-28 12:35:00,571 INFO [train.py:996] (0/4) Epoch 12, batch 7450, loss[loss=0.1958, simple_loss=0.2662, pruned_loss=0.06274, over 21656.00 frames. ], tot_loss[loss=0.2027, simple_loss=0.2781, pruned_loss=0.06365, over 4262014.91 frames. ], batch size: 282, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:35:01,312 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2057346.0, ans=0.1 2023-06-28 12:36:08,812 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.19 vs. limit=22.5 2023-06-28 12:36:49,962 INFO [train.py:996] (0/4) Epoch 12, batch 7500, loss[loss=0.2572, simple_loss=0.3614, pruned_loss=0.07654, over 21753.00 frames. ], tot_loss[loss=0.2073, simple_loss=0.2837, pruned_loss=0.06541, over 4266987.24 frames. ], batch size: 351, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:37:33,889 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.996e+02 7.365e+02 1.053e+03 1.699e+03 4.084e+03, threshold=2.105e+03, percent-clipped=21.0 2023-06-28 12:38:13,481 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=2057886.0, ans=0.125 2023-06-28 12:38:34,123 INFO [train.py:996] (0/4) Epoch 12, batch 7550, loss[loss=0.1864, simple_loss=0.2776, pruned_loss=0.04763, over 21694.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.2911, pruned_loss=0.06449, over 4271957.00 frames. ], batch size: 247, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:39:08,801 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.70 vs. limit=10.0 2023-06-28 12:39:41,285 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2058126.0, ans=0.125 2023-06-28 12:39:55,287 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.35 vs. limit=10.0 2023-06-28 12:39:56,032 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=2058186.0, ans=0.125 2023-06-28 12:40:16,306 INFO [train.py:996] (0/4) Epoch 12, batch 7600, loss[loss=0.2017, simple_loss=0.2916, pruned_loss=0.05585, over 21685.00 frames. ], tot_loss[loss=0.2075, simple_loss=0.2893, pruned_loss=0.0629, over 4268706.45 frames. ], batch size: 389, lr: 2.45e-03, grad_scale: 32.0 2023-06-28 12:40:58,887 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.988e+02 7.986e+02 1.163e+03 1.762e+03 3.955e+03, threshold=2.326e+03, percent-clipped=12.0 2023-06-28 12:41:22,069 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2058426.0, ans=0.125 2023-06-28 12:41:27,934 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.58 vs. limit=22.5 2023-06-28 12:41:35,612 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 12:41:37,109 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=2058486.0, ans=0.0 2023-06-28 12:41:40,587 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=2058486.0, ans=10.0 2023-06-28 12:41:57,889 INFO [train.py:996] (0/4) Epoch 12, batch 7650, loss[loss=0.1651, simple_loss=0.2659, pruned_loss=0.03217, over 20754.00 frames. ], tot_loss[loss=0.2081, simple_loss=0.2875, pruned_loss=0.06436, over 4278315.33 frames. ], batch size: 609, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:42:55,372 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=2058666.0, ans=0.125 2023-06-28 12:42:59,028 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2058666.0, ans=0.125 2023-06-28 12:43:19,321 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2058786.0, ans=0.125 2023-06-28 12:43:45,448 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2058846.0, ans=0.125 2023-06-28 12:43:46,552 INFO [train.py:996] (0/4) Epoch 12, batch 7700, loss[loss=0.2206, simple_loss=0.2943, pruned_loss=0.07343, over 21832.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.2904, pruned_loss=0.06725, over 4281376.07 frames. ], batch size: 282, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:44:16,230 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=2058906.0, ans=0.07 2023-06-28 12:44:16,247 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=2058906.0, ans=0.0 2023-06-28 12:44:31,977 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.981e+02 7.414e+02 1.157e+03 1.590e+03 5.387e+03, threshold=2.314e+03, percent-clipped=8.0 2023-06-28 12:44:43,672 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.39 vs. limit=12.0 2023-06-28 12:45:36,625 INFO [train.py:996] (0/4) Epoch 12, batch 7750, loss[loss=0.2898, simple_loss=0.4052, pruned_loss=0.08716, over 21228.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.298, pruned_loss=0.06836, over 4280862.38 frames. ], batch size: 549, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 12:45:38,867 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=2059146.0, ans=0.125 2023-06-28 12:45:38,951 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2059146.0, ans=0.125 2023-06-28 12:45:40,551 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=2059146.0, ans=0.125 2023-06-28 12:46:53,930 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2059326.0, ans=0.0 2023-06-28 12:47:16,258 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.02 vs. limit=22.5 2023-06-28 12:47:21,161 INFO [train.py:996] (0/4) Epoch 12, batch 7800, loss[loss=0.1983, simple_loss=0.2743, pruned_loss=0.06121, over 21589.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2995, pruned_loss=0.069, over 4274826.62 frames. ], batch size: 263, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 12:47:26,481 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2059446.0, ans=0.0 2023-06-28 12:47:26,511 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2059446.0, ans=0.1 2023-06-28 12:47:29,999 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=2059446.0, ans=0.125 2023-06-28 12:48:00,037 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.542e+02 9.199e+02 1.440e+03 2.477e+03 5.669e+03, threshold=2.881e+03, percent-clipped=30.0 2023-06-28 12:48:41,027 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 12:48:42,425 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2059686.0, ans=0.125 2023-06-28 12:49:03,636 INFO [train.py:996] (0/4) Epoch 12, batch 7850, loss[loss=0.2075, simple_loss=0.3073, pruned_loss=0.05389, over 20819.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2921, pruned_loss=0.06771, over 4266198.04 frames. ], batch size: 609, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 12:50:16,815 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=2059926.0, ans=0.2 2023-06-28 12:50:30,314 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 12:50:48,150 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2060046.0, ans=0.0 2023-06-28 12:50:49,207 INFO [train.py:996] (0/4) Epoch 12, batch 7900, loss[loss=0.3308, simple_loss=0.4154, pruned_loss=0.1231, over 21413.00 frames. ], tot_loss[loss=0.2107, simple_loss=0.2883, pruned_loss=0.06656, over 4256536.98 frames. ], batch size: 507, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 12:51:26,668 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.66 vs. limit=6.0 2023-06-28 12:51:30,569 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.553e+02 9.216e+02 1.431e+03 2.035e+03 3.808e+03, threshold=2.862e+03, percent-clipped=8.0 2023-06-28 12:51:34,653 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=2060166.0, ans=10.0 2023-06-28 12:52:10,706 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2060226.0, ans=0.125 2023-06-28 12:52:38,424 INFO [train.py:996] (0/4) Epoch 12, batch 7950, loss[loss=0.1908, simple_loss=0.2757, pruned_loss=0.05293, over 21477.00 frames. ], tot_loss[loss=0.2122, simple_loss=0.2924, pruned_loss=0.06607, over 4261888.33 frames. ], batch size: 211, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 12:52:54,542 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2060406.0, ans=0.0 2023-06-28 12:53:02,946 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=2060406.0, ans=0.125 2023-06-28 12:53:22,546 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2060466.0, ans=0.1 2023-06-28 12:53:34,846 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.45 vs. limit=10.0 2023-06-28 12:53:59,378 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2060526.0, ans=0.125 2023-06-28 12:54:24,587 INFO [train.py:996] (0/4) Epoch 12, batch 8000, loss[loss=0.2212, simple_loss=0.3378, pruned_loss=0.05228, over 19916.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2973, pruned_loss=0.06763, over 4262690.42 frames. ], batch size: 702, lr: 2.44e-03, grad_scale: 32.0 2023-06-28 12:54:27,147 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2060646.0, ans=0.0 2023-06-28 12:55:18,764 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.842e+02 9.882e+02 1.672e+03 2.798e+03 5.114e+03, threshold=3.344e+03, percent-clipped=23.0 2023-06-28 12:56:13,505 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 12:56:16,329 INFO [train.py:996] (0/4) Epoch 12, batch 8050, loss[loss=0.2043, simple_loss=0.284, pruned_loss=0.06228, over 20059.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.301, pruned_loss=0.06826, over 4262353.57 frames. ], batch size: 702, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 12:56:28,452 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2060946.0, ans=0.125 2023-06-28 12:56:47,498 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=6.54 vs. limit=22.5 2023-06-28 12:57:01,073 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.26 vs. limit=15.0 2023-06-28 12:57:28,800 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=2061126.0, ans=0.2 2023-06-28 12:57:30,941 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.11 vs. limit=15.0 2023-06-28 12:57:51,998 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2061186.0, ans=0.125 2023-06-28 12:58:04,717 INFO [train.py:996] (0/4) Epoch 12, batch 8100, loss[loss=0.2168, simple_loss=0.2953, pruned_loss=0.06919, over 21893.00 frames. ], tot_loss[loss=0.219, simple_loss=0.2999, pruned_loss=0.06908, over 4271302.08 frames. ], batch size: 371, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 12:58:05,441 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=2061246.0, ans=0.125 2023-06-28 12:58:53,297 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.830e+02 7.832e+02 1.202e+03 2.450e+03 5.574e+03, threshold=2.405e+03, percent-clipped=10.0 2023-06-28 12:59:03,415 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.85 vs. limit=15.0 2023-06-28 12:59:22,072 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=2061426.0, ans=0.0 2023-06-28 12:59:56,682 INFO [train.py:996] (0/4) Epoch 12, batch 8150, loss[loss=0.1992, simple_loss=0.2904, pruned_loss=0.05397, over 21468.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.3077, pruned_loss=0.07029, over 4273343.91 frames. ], batch size: 212, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:00:36,186 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.45 vs. limit=15.0 2023-06-28 13:00:57,100 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2061726.0, ans=0.125 2023-06-28 13:01:07,311 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=2061726.0, ans=0.125 2023-06-28 13:01:39,549 INFO [train.py:996] (0/4) Epoch 12, batch 8200, loss[loss=0.186, simple_loss=0.2328, pruned_loss=0.06961, over 20189.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2986, pruned_loss=0.06838, over 4268841.44 frames. ], batch size: 703, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:02:05,671 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.59 vs. limit=22.5 2023-06-28 13:02:21,472 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.733e+02 7.541e+02 1.166e+03 1.975e+03 4.840e+03, threshold=2.333e+03, percent-clipped=18.0 2023-06-28 13:03:07,595 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.18 vs. limit=22.5 2023-06-28 13:03:07,747 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.07 vs. limit=6.0 2023-06-28 13:03:23,733 INFO [train.py:996] (0/4) Epoch 12, batch 8250, loss[loss=0.2263, simple_loss=0.3254, pruned_loss=0.06355, over 21716.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2953, pruned_loss=0.06772, over 4260795.59 frames. ], batch size: 298, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:04:02,839 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2062266.0, ans=0.1 2023-06-28 13:04:14,185 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.17 vs. limit=15.0 2023-06-28 13:04:58,552 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.37 vs. limit=15.0 2023-06-28 13:05:07,863 INFO [train.py:996] (0/4) Epoch 12, batch 8300, loss[loss=0.2415, simple_loss=0.3236, pruned_loss=0.0797, over 21516.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2937, pruned_loss=0.06478, over 4266052.04 frames. ], batch size: 471, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:05:49,523 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.960e+02 7.792e+02 1.211e+03 1.944e+03 6.178e+03, threshold=2.421e+03, percent-clipped=18.0 2023-06-28 13:06:10,173 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.72 vs. limit=15.0 2023-06-28 13:06:13,452 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 13:06:26,506 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2062626.0, ans=0.1 2023-06-28 13:06:41,332 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2062686.0, ans=0.125 2023-06-28 13:06:55,854 INFO [train.py:996] (0/4) Epoch 12, batch 8350, loss[loss=0.1808, simple_loss=0.2659, pruned_loss=0.04791, over 21698.00 frames. ], tot_loss[loss=0.2081, simple_loss=0.2914, pruned_loss=0.06237, over 4269863.32 frames. ], batch size: 282, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:07:03,695 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff3.min_abs, batch_count=2062746.0, ans=0.2 2023-06-28 13:07:09,228 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.24 vs. limit=15.0 2023-06-28 13:08:18,878 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=2062986.0, ans=0.125 2023-06-28 13:08:39,735 INFO [train.py:996] (0/4) Epoch 12, batch 8400, loss[loss=0.2567, simple_loss=0.3787, pruned_loss=0.06734, over 20755.00 frames. ], tot_loss[loss=0.2066, simple_loss=0.2907, pruned_loss=0.06122, over 4270664.82 frames. ], batch size: 607, lr: 2.44e-03, grad_scale: 32.0 2023-06-28 13:08:58,402 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 13:09:07,978 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2063106.0, ans=0.1 2023-06-28 13:09:21,787 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.051e+02 6.739e+02 1.036e+03 1.500e+03 3.619e+03, threshold=2.071e+03, percent-clipped=10.0 2023-06-28 13:09:58,876 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 13:10:04,601 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.43 vs. limit=15.0 2023-06-28 13:10:07,240 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff2.min_abs, batch_count=2063286.0, ans=0.1 2023-06-28 13:10:20,167 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=2063346.0, ans=0.0 2023-06-28 13:10:21,261 INFO [train.py:996] (0/4) Epoch 12, batch 8450, loss[loss=0.2015, simple_loss=0.2739, pruned_loss=0.06458, over 21828.00 frames. ], tot_loss[loss=0.2049, simple_loss=0.2883, pruned_loss=0.06071, over 4282010.32 frames. ], batch size: 298, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:10:22,166 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2063346.0, ans=0.1 2023-06-28 13:10:30,832 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.77 vs. limit=12.0 2023-06-28 13:10:33,381 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2063346.0, ans=0.125 2023-06-28 13:10:41,326 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 13:11:01,757 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=2063466.0, ans=0.0 2023-06-28 13:12:04,203 INFO [train.py:996] (0/4) Epoch 12, batch 8500, loss[loss=0.1869, simple_loss=0.2533, pruned_loss=0.06023, over 21698.00 frames. ], tot_loss[loss=0.2038, simple_loss=0.284, pruned_loss=0.06182, over 4272971.21 frames. ], batch size: 264, lr: 2.44e-03, grad_scale: 8.0 2023-06-28 13:12:42,453 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.97 vs. limit=15.0 2023-06-28 13:12:47,168 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2063766.0, ans=0.0 2023-06-28 13:12:47,212 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2063766.0, ans=0.1 2023-06-28 13:12:47,214 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=2063766.0, ans=0.2 2023-06-28 13:12:49,780 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.731e+02 8.144e+02 1.139e+03 1.907e+03 5.140e+03, threshold=2.279e+03, percent-clipped=18.0 2023-06-28 13:13:00,170 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=2063766.0, ans=0.2 2023-06-28 13:13:01,767 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=2063766.0, ans=0.2 2023-06-28 13:13:16,874 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=2063826.0, ans=0.0 2023-06-28 13:13:48,463 INFO [train.py:996] (0/4) Epoch 12, batch 8550, loss[loss=0.1956, simple_loss=0.2777, pruned_loss=0.05677, over 21778.00 frames. ], tot_loss[loss=0.2086, simple_loss=0.2889, pruned_loss=0.06411, over 4266507.76 frames. ], batch size: 118, lr: 2.44e-03, grad_scale: 8.0 2023-06-28 13:14:02,532 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-344000.pt 2023-06-28 13:14:55,708 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=2064126.0, ans=0.2 2023-06-28 13:15:10,525 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2064126.0, ans=0.125 2023-06-28 13:15:34,962 INFO [train.py:996] (0/4) Epoch 12, batch 8600, loss[loss=0.2152, simple_loss=0.3, pruned_loss=0.06523, over 21743.00 frames. ], tot_loss[loss=0.2133, simple_loss=0.2944, pruned_loss=0.06607, over 4264291.86 frames. ], batch size: 332, lr: 2.44e-03, grad_scale: 8.0 2023-06-28 13:15:40,851 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=2064246.0, ans=0.125 2023-06-28 13:15:40,894 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=2064246.0, ans=0.125 2023-06-28 13:15:41,567 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.72 vs. limit=22.5 2023-06-28 13:16:14,004 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=2064306.0, ans=0.125 2023-06-28 13:16:29,847 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.562e+02 1.076e+03 1.611e+03 2.403e+03 4.318e+03, threshold=3.223e+03, percent-clipped=30.0 2023-06-28 13:17:18,551 INFO [train.py:996] (0/4) Epoch 12, batch 8650, loss[loss=0.1758, simple_loss=0.2844, pruned_loss=0.0336, over 21778.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.3003, pruned_loss=0.06772, over 4263992.59 frames. ], batch size: 332, lr: 2.44e-03, grad_scale: 8.0 2023-06-28 13:18:55,408 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2064786.0, ans=0.125 2023-06-28 13:18:57,742 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.93 vs. limit=15.0 2023-06-28 13:18:59,821 INFO [train.py:996] (0/4) Epoch 12, batch 8700, loss[loss=0.2165, simple_loss=0.2695, pruned_loss=0.08172, over 21221.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.292, pruned_loss=0.06408, over 4266921.49 frames. ], batch size: 471, lr: 2.44e-03, grad_scale: 8.0 2023-06-28 13:19:53,201 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.81 vs. limit=15.0 2023-06-28 13:19:53,462 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.697e+02 7.863e+02 1.211e+03 1.985e+03 4.359e+03, threshold=2.422e+03, percent-clipped=4.0 2023-06-28 13:19:54,042 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=2064966.0, ans=0.0 2023-06-28 13:19:55,771 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2064966.0, ans=0.125 2023-06-28 13:19:59,144 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2064966.0, ans=0.125 2023-06-28 13:20:00,358 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=2064966.0, ans=0.125 2023-06-28 13:20:02,231 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=2065026.0, ans=0.2 2023-06-28 13:20:41,881 INFO [train.py:996] (0/4) Epoch 12, batch 8750, loss[loss=0.235, simple_loss=0.312, pruned_loss=0.07897, over 21899.00 frames. ], tot_loss[loss=0.2091, simple_loss=0.2887, pruned_loss=0.06477, over 4269553.65 frames. ], batch size: 107, lr: 2.44e-03, grad_scale: 8.0 2023-06-28 13:21:10,117 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2065206.0, ans=0.125 2023-06-28 13:21:25,222 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=2065266.0, ans=0.0 2023-06-28 13:22:31,074 INFO [train.py:996] (0/4) Epoch 12, batch 8800, loss[loss=0.2586, simple_loss=0.3392, pruned_loss=0.08897, over 21286.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2977, pruned_loss=0.06727, over 4271538.40 frames. ], batch size: 548, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:23:26,840 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.165e+02 8.763e+02 1.222e+03 1.735e+03 3.559e+03, threshold=2.444e+03, percent-clipped=10.0 2023-06-28 13:23:34,548 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.88 vs. limit=15.0 2023-06-28 13:23:35,947 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2065626.0, ans=0.0 2023-06-28 13:23:40,654 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2065626.0, ans=0.0 2023-06-28 13:23:56,720 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.14 vs. limit=15.0 2023-06-28 13:24:16,137 INFO [train.py:996] (0/4) Epoch 12, batch 8850, loss[loss=0.2533, simple_loss=0.3091, pruned_loss=0.09878, over 21391.00 frames. ], tot_loss[loss=0.22, simple_loss=0.303, pruned_loss=0.06844, over 4275263.72 frames. ], batch size: 508, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:26:05,235 INFO [train.py:996] (0/4) Epoch 12, batch 8900, loss[loss=0.184, simple_loss=0.2601, pruned_loss=0.05394, over 21542.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2974, pruned_loss=0.0675, over 4274334.88 frames. ], batch size: 230, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:26:16,524 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2066046.0, ans=0.1 2023-06-28 13:26:47,871 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.32 vs. limit=15.0 2023-06-28 13:26:57,492 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.266e+02 7.347e+02 1.235e+03 1.790e+03 4.739e+03, threshold=2.470e+03, percent-clipped=10.0 2023-06-28 13:27:56,311 INFO [train.py:996] (0/4) Epoch 12, batch 8950, loss[loss=0.2154, simple_loss=0.3028, pruned_loss=0.06399, over 21857.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2993, pruned_loss=0.06686, over 4276748.62 frames. ], batch size: 317, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:28:12,617 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=21.11 vs. limit=22.5 2023-06-28 13:28:17,498 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=2066406.0, ans=15.0 2023-06-28 13:28:33,557 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2066466.0, ans=0.125 2023-06-28 13:28:46,872 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=2066466.0, ans=0.2 2023-06-28 13:28:58,785 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.11 vs. limit=15.0 2023-06-28 13:29:38,963 INFO [train.py:996] (0/4) Epoch 12, batch 9000, loss[loss=0.1947, simple_loss=0.2749, pruned_loss=0.05722, over 21579.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2935, pruned_loss=0.06731, over 4277220.38 frames. ], batch size: 298, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:29:38,964 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-28 13:29:59,527 INFO [train.py:1028] (0/4) Epoch 12, validation: loss=0.2628, simple_loss=0.3535, pruned_loss=0.086, over 1796401.00 frames. 2023-06-28 13:29:59,529 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23714MB 2023-06-28 13:30:12,027 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2066646.0, ans=0.125 2023-06-28 13:30:17,307 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2066706.0, ans=0.1 2023-06-28 13:30:44,982 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.661e+02 7.055e+02 9.403e+02 1.588e+03 4.919e+03, threshold=1.881e+03, percent-clipped=11.0 2023-06-28 13:31:09,426 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=2066826.0, ans=0.2 2023-06-28 13:31:44,375 INFO [train.py:996] (0/4) Epoch 12, batch 9050, loss[loss=0.2083, simple_loss=0.2893, pruned_loss=0.06367, over 21351.00 frames. ], tot_loss[loss=0.2087, simple_loss=0.2883, pruned_loss=0.06456, over 4276840.58 frames. ], batch size: 549, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:31:46,767 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2066946.0, ans=0.125 2023-06-28 13:31:56,032 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=19.49 vs. limit=15.0 2023-06-28 13:32:04,699 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.09 vs. limit=15.0 2023-06-28 13:32:30,109 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2067066.0, ans=0.1 2023-06-28 13:33:27,910 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2067186.0, ans=0.1 2023-06-28 13:33:30,617 INFO [train.py:996] (0/4) Epoch 12, batch 9100, loss[loss=0.2132, simple_loss=0.3134, pruned_loss=0.05653, over 21618.00 frames. ], tot_loss[loss=0.2129, simple_loss=0.2925, pruned_loss=0.06666, over 4275926.84 frames. ], batch size: 389, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:33:42,025 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2067246.0, ans=0.1 2023-06-28 13:34:22,065 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.166e+02 1.280e+03 2.185e+03 3.198e+03 4.785e+03, threshold=4.371e+03, percent-clipped=55.0 2023-06-28 13:34:36,288 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2067426.0, ans=0.125 2023-06-28 13:34:53,093 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=2067426.0, ans=0.0 2023-06-28 13:35:06,725 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2067486.0, ans=0.0 2023-06-28 13:35:16,214 INFO [train.py:996] (0/4) Epoch 12, batch 9150, loss[loss=0.2064, simple_loss=0.2916, pruned_loss=0.06059, over 21325.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2958, pruned_loss=0.06476, over 4275329.46 frames. ], batch size: 159, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:36:34,071 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.93 vs. limit=15.0 2023-06-28 13:36:54,118 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=8.30 vs. limit=15.0 2023-06-28 13:36:59,437 INFO [train.py:996] (0/4) Epoch 12, batch 9200, loss[loss=0.25, simple_loss=0.3334, pruned_loss=0.08327, over 21802.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2981, pruned_loss=0.06453, over 4272608.47 frames. ], batch size: 124, lr: 2.44e-03, grad_scale: 32.0 2023-06-28 13:37:46,209 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.17 vs. limit=15.0 2023-06-28 13:38:01,133 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.761e+02 9.017e+02 1.569e+03 2.101e+03 3.767e+03, threshold=3.138e+03, percent-clipped=0.0 2023-06-28 13:38:48,666 INFO [train.py:996] (0/4) Epoch 12, batch 9250, loss[loss=0.2422, simple_loss=0.3109, pruned_loss=0.08677, over 21325.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.2999, pruned_loss=0.06693, over 4277060.29 frames. ], batch size: 548, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:39:08,141 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=2068146.0, ans=0.2 2023-06-28 13:39:16,939 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=2068206.0, ans=0.125 2023-06-28 13:39:17,055 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2068206.0, ans=0.125 2023-06-28 13:39:47,654 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.75 vs. limit=15.0 2023-06-28 13:39:54,799 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.26 vs. limit=15.0 2023-06-28 13:39:58,317 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.95 vs. limit=10.0 2023-06-28 13:40:20,253 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 13:40:39,789 INFO [train.py:996] (0/4) Epoch 12, batch 9300, loss[loss=0.2089, simple_loss=0.2867, pruned_loss=0.06558, over 21520.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2946, pruned_loss=0.06635, over 4271710.42 frames. ], batch size: 195, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:41:08,242 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2068506.0, ans=0.125 2023-06-28 13:41:25,033 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.88 vs. limit=12.0 2023-06-28 13:41:32,630 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.166e+02 1.033e+03 1.685e+03 2.661e+03 5.053e+03, threshold=3.371e+03, percent-clipped=15.0 2023-06-28 13:42:25,450 INFO [train.py:996] (0/4) Epoch 12, batch 9350, loss[loss=0.2377, simple_loss=0.3212, pruned_loss=0.07704, over 21260.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.3006, pruned_loss=0.06737, over 4268957.88 frames. ], batch size: 143, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:42:33,284 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2068746.0, ans=0.0 2023-06-28 13:42:43,198 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2068746.0, ans=0.125 2023-06-28 13:42:45,068 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=2068746.0, ans=0.2 2023-06-28 13:42:45,710 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.45 vs. limit=15.0 2023-06-28 13:42:52,920 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2068806.0, ans=0.1 2023-06-28 13:42:54,825 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2068806.0, ans=0.0 2023-06-28 13:42:54,844 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2068806.0, ans=0.125 2023-06-28 13:43:00,027 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2068806.0, ans=0.1 2023-06-28 13:43:11,764 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 13:43:36,497 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.75 vs. limit=10.0 2023-06-28 13:43:56,152 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=2068986.0, ans=0.0 2023-06-28 13:44:04,504 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2068986.0, ans=0.0 2023-06-28 13:44:15,564 INFO [train.py:996] (0/4) Epoch 12, batch 9400, loss[loss=0.2003, simple_loss=0.2781, pruned_loss=0.06127, over 21508.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.3024, pruned_loss=0.06843, over 4259107.08 frames. ], batch size: 389, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:44:34,102 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 13:45:01,444 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.886e+02 7.931e+02 1.125e+03 1.716e+03 3.605e+03, threshold=2.249e+03, percent-clipped=1.0 2023-06-28 13:45:58,272 INFO [train.py:996] (0/4) Epoch 12, batch 9450, loss[loss=0.2665, simple_loss=0.3882, pruned_loss=0.07238, over 19713.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2965, pruned_loss=0.06702, over 4262559.29 frames. ], batch size: 702, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:46:17,435 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2069346.0, ans=0.125 2023-06-28 13:46:29,424 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.69 vs. limit=15.0 2023-06-28 13:46:33,786 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=2069406.0, ans=0.2 2023-06-28 13:47:12,163 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.07 vs. limit=15.0 2023-06-28 13:47:15,132 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=2069526.0, ans=0.0 2023-06-28 13:47:41,520 INFO [train.py:996] (0/4) Epoch 12, batch 9500, loss[loss=0.1912, simple_loss=0.2588, pruned_loss=0.06184, over 21137.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2892, pruned_loss=0.06582, over 4260634.53 frames. ], batch size: 159, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:48:06,207 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.54 vs. limit=15.0 2023-06-28 13:48:07,084 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2069706.0, ans=0.0 2023-06-28 13:48:38,315 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.464e+02 8.117e+02 1.177e+03 1.570e+03 4.123e+03, threshold=2.354e+03, percent-clipped=16.0 2023-06-28 13:49:05,673 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2069886.0, ans=0.1 2023-06-28 13:49:15,972 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=2069886.0, ans=0.2 2023-06-28 13:49:25,115 INFO [train.py:996] (0/4) Epoch 12, batch 9550, loss[loss=0.225, simple_loss=0.3118, pruned_loss=0.06913, over 21774.00 frames. ], tot_loss[loss=0.2108, simple_loss=0.2893, pruned_loss=0.0661, over 4264645.09 frames. ], batch size: 247, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:49:33,922 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=2069946.0, ans=0.0 2023-06-28 13:50:02,908 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2070066.0, ans=0.0 2023-06-28 13:50:02,978 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2070066.0, ans=0.1 2023-06-28 13:50:40,490 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.87 vs. limit=12.0 2023-06-28 13:51:04,174 INFO [train.py:996] (0/4) Epoch 12, batch 9600, loss[loss=0.1801, simple_loss=0.2577, pruned_loss=0.05122, over 21783.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2946, pruned_loss=0.06802, over 4271366.94 frames. ], batch size: 282, lr: 2.44e-03, grad_scale: 32.0 2023-06-28 13:51:32,527 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2070306.0, ans=0.1 2023-06-28 13:52:01,559 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.070e+02 8.059e+02 1.139e+03 1.979e+03 4.989e+03, threshold=2.277e+03, percent-clipped=18.0 2023-06-28 13:52:18,975 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2070426.0, ans=0.125 2023-06-28 13:52:40,240 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.18 vs. limit=15.0 2023-06-28 13:52:41,274 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2070486.0, ans=0.125 2023-06-28 13:52:52,034 INFO [train.py:996] (0/4) Epoch 12, batch 9650, loss[loss=0.2332, simple_loss=0.3114, pruned_loss=0.0775, over 21707.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.2944, pruned_loss=0.06768, over 4272864.21 frames. ], batch size: 351, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:53:00,987 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=2070546.0, ans=0.015 2023-06-28 13:53:03,211 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.73 vs. limit=15.0 2023-06-28 13:53:12,453 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=2070606.0, ans=0.125 2023-06-28 13:53:14,584 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2070606.0, ans=0.125 2023-06-28 13:53:51,080 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2070726.0, ans=0.125 2023-06-28 13:54:32,058 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=2070786.0, ans=0.125 2023-06-28 13:54:36,712 INFO [train.py:996] (0/4) Epoch 12, batch 9700, loss[loss=0.244, simple_loss=0.3166, pruned_loss=0.08565, over 21395.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.2974, pruned_loss=0.06799, over 4277006.18 frames. ], batch size: 548, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:55:26,190 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.10 vs. limit=22.5 2023-06-28 13:55:29,991 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.817e+02 8.034e+02 1.157e+03 1.856e+03 3.207e+03, threshold=2.314e+03, percent-clipped=13.0 2023-06-28 13:55:35,741 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2071026.0, ans=0.0 2023-06-28 13:55:51,226 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.58 vs. limit=15.0 2023-06-28 13:56:18,674 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.13 vs. limit=15.0 2023-06-28 13:56:19,108 INFO [train.py:996] (0/4) Epoch 12, batch 9750, loss[loss=0.1828, simple_loss=0.2492, pruned_loss=0.05824, over 21577.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2927, pruned_loss=0.06707, over 4273972.91 frames. ], batch size: 247, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:56:19,565 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2071146.0, ans=0.125 2023-06-28 13:56:41,229 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=2071206.0, ans=0.125 2023-06-28 13:57:03,410 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.11 vs. limit=15.0 2023-06-28 13:57:04,440 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2071266.0, ans=0.1 2023-06-28 13:57:06,705 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.46 vs. limit=6.0 2023-06-28 13:57:19,376 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=2071326.0, ans=0.2 2023-06-28 13:57:24,454 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2071326.0, ans=0.125 2023-06-28 13:57:41,232 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.71 vs. limit=22.5 2023-06-28 13:58:01,398 INFO [train.py:996] (0/4) Epoch 12, batch 9800, loss[loss=0.2147, simple_loss=0.2938, pruned_loss=0.06782, over 21908.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2912, pruned_loss=0.06707, over 4273628.14 frames. ], batch size: 333, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:58:49,913 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=2071566.0, ans=0.04949747468305833 2023-06-28 13:58:49,936 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2071566.0, ans=0.125 2023-06-28 13:58:54,275 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.350e+02 9.272e+02 1.641e+03 2.423e+03 5.120e+03, threshold=3.282e+03, percent-clipped=25.0 2023-06-28 13:59:16,273 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=2071626.0, ans=0.0 2023-06-28 13:59:43,789 INFO [train.py:996] (0/4) Epoch 12, batch 9850, loss[loss=0.1779, simple_loss=0.2401, pruned_loss=0.05782, over 21618.00 frames. ], tot_loss[loss=0.2103, simple_loss=0.2873, pruned_loss=0.06663, over 4283305.09 frames. ], batch size: 247, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 14:01:25,450 INFO [train.py:996] (0/4) Epoch 12, batch 9900, loss[loss=0.2261, simple_loss=0.3025, pruned_loss=0.07486, over 21833.00 frames. ], tot_loss[loss=0.2087, simple_loss=0.2844, pruned_loss=0.06643, over 4286063.14 frames. ], batch size: 371, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 14:01:27,979 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2072046.0, ans=0.0 2023-06-28 14:01:41,140 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=2072106.0, ans=0.125 2023-06-28 14:01:58,814 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=2072106.0, ans=0.0 2023-06-28 14:02:17,076 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2072166.0, ans=0.125 2023-06-28 14:02:19,783 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.295e+02 1.063e+03 1.503e+03 2.102e+03 4.753e+03, threshold=3.006e+03, percent-clipped=10.0 2023-06-28 14:03:03,435 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=2072286.0, ans=0.125 2023-06-28 14:03:09,590 INFO [train.py:996] (0/4) Epoch 12, batch 9950, loss[loss=0.2448, simple_loss=0.3257, pruned_loss=0.08192, over 21407.00 frames. ], tot_loss[loss=0.2099, simple_loss=0.2841, pruned_loss=0.06789, over 4278949.80 frames. ], batch size: 131, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 14:03:30,968 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.69 vs. limit=6.0 2023-06-28 14:03:33,609 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=2072406.0, ans=0.125 2023-06-28 14:04:21,752 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=2072526.0, ans=0.0 2023-06-28 14:04:31,809 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2072586.0, ans=0.0 2023-06-28 14:04:46,875 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2072586.0, ans=0.1 2023-06-28 14:04:52,802 INFO [train.py:996] (0/4) Epoch 12, batch 10000, loss[loss=0.2035, simple_loss=0.2812, pruned_loss=0.0629, over 21961.00 frames. ], tot_loss[loss=0.2072, simple_loss=0.2806, pruned_loss=0.06688, over 4271695.86 frames. ], batch size: 317, lr: 2.44e-03, grad_scale: 32.0 2023-06-28 14:05:03,434 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2072646.0, ans=0.125 2023-06-28 14:05:50,334 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.893e+02 6.803e+02 1.015e+03 1.604e+03 3.420e+03, threshold=2.029e+03, percent-clipped=1.0 2023-06-28 14:06:18,539 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=2072886.0, ans=0.0 2023-06-28 14:06:36,067 INFO [train.py:996] (0/4) Epoch 12, batch 10050, loss[loss=0.1898, simple_loss=0.2722, pruned_loss=0.05369, over 21066.00 frames. ], tot_loss[loss=0.2067, simple_loss=0.2806, pruned_loss=0.06635, over 4271034.59 frames. ], batch size: 607, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 14:06:45,608 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.80 vs. limit=15.0 2023-06-28 14:07:48,075 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2073126.0, ans=0.1 2023-06-28 14:08:11,597 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=2073186.0, ans=0.2 2023-06-28 14:08:21,350 INFO [train.py:996] (0/4) Epoch 12, batch 10100, loss[loss=0.2205, simple_loss=0.2921, pruned_loss=0.07449, over 21821.00 frames. ], tot_loss[loss=0.2047, simple_loss=0.279, pruned_loss=0.06519, over 4265535.32 frames. ], batch size: 124, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 14:08:35,698 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=2073246.0, ans=15.0 2023-06-28 14:09:00,378 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=2073306.0, ans=0.07 2023-06-28 14:09:10,197 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=2073366.0, ans=0.0 2023-06-28 14:09:12,031 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=2073366.0, ans=0.0 2023-06-28 14:09:21,191 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.696e+02 9.806e+02 1.615e+03 2.401e+03 4.786e+03, threshold=3.230e+03, percent-clipped=36.0 2023-06-28 14:09:27,081 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=2073426.0, ans=0.09899494936611666 2023-06-28 14:10:10,008 INFO [train.py:996] (0/4) Epoch 12, batch 10150, loss[loss=0.2221, simple_loss=0.3139, pruned_loss=0.06513, over 21734.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.2859, pruned_loss=0.06758, over 4266517.06 frames. ], batch size: 332, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 14:10:12,400 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2073546.0, ans=0.125 2023-06-28 14:10:22,057 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=2073546.0, ans=0.05 2023-06-28 14:10:25,357 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2073546.0, ans=0.1 2023-06-28 14:10:48,520 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=2073606.0, ans=0.09899494936611666 2023-06-28 14:10:49,070 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.61 vs. limit=10.0 2023-06-28 14:11:51,908 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=2073846.0, ans=0.2 2023-06-28 14:11:52,850 INFO [train.py:996] (0/4) Epoch 12, batch 10200, loss[loss=0.1975, simple_loss=0.2728, pruned_loss=0.06116, over 21802.00 frames. ], tot_loss[loss=0.2088, simple_loss=0.2857, pruned_loss=0.06595, over 4265244.14 frames. ], batch size: 124, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 14:11:53,866 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.80 vs. limit=22.5 2023-06-28 14:12:18,398 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=2073906.0, ans=0.125 2023-06-28 14:12:47,767 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.411e+02 8.616e+02 1.269e+03 2.043e+03 3.610e+03, threshold=2.539e+03, percent-clipped=1.0 2023-06-28 14:12:50,079 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=2073966.0, ans=0.2 2023-06-28 14:13:01,143 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=2074026.0, ans=0.025 2023-06-28 14:13:40,922 INFO [train.py:996] (0/4) Epoch 12, batch 10250, loss[loss=0.149, simple_loss=0.2349, pruned_loss=0.03154, over 21627.00 frames. ], tot_loss[loss=0.2014, simple_loss=0.2811, pruned_loss=0.06086, over 4262980.78 frames. ], batch size: 195, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 14:13:51,253 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2074146.0, ans=0.1 2023-06-28 14:14:52,273 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=2074326.0, ans=0.125 2023-06-28 14:15:14,486 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.38 vs. limit=15.0 2023-06-28 14:15:22,272 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=2074386.0, ans=0.2 2023-06-28 14:15:25,097 INFO [train.py:996] (0/4) Epoch 12, batch 10300, loss[loss=0.2192, simple_loss=0.3048, pruned_loss=0.06681, over 21442.00 frames. ], tot_loss[loss=0.2053, simple_loss=0.2858, pruned_loss=0.06244, over 4269260.67 frames. ], batch size: 194, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 14:15:51,521 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2074506.0, ans=0.1 2023-06-28 14:16:18,569 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.94 vs. limit=15.0 2023-06-28 14:16:22,435 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 3.719e+02 6.981e+02 1.162e+03 1.847e+03 5.403e+03, threshold=2.324e+03, percent-clipped=10.0 2023-06-28 14:16:30,246 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2074626.0, ans=0.1 2023-06-28 14:16:33,740 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=2074626.0, ans=0.0 2023-06-28 14:16:46,664 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2074626.0, ans=0.125 2023-06-28 14:17:02,179 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=2074686.0, ans=0.125 2023-06-28 14:17:07,337 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2074686.0, ans=0.0 2023-06-28 14:17:11,846 INFO [train.py:996] (0/4) Epoch 12, batch 10350, loss[loss=0.1831, simple_loss=0.2477, pruned_loss=0.05926, over 21541.00 frames. ], tot_loss[loss=0.2072, simple_loss=0.2883, pruned_loss=0.063, over 4264597.97 frames. ], batch size: 212, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 14:17:34,339 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=2074806.0, ans=0.0 2023-06-28 14:17:57,955 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2074866.0, ans=0.1 2023-06-28 14:18:10,074 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=2074866.0, ans=0.125 2023-06-28 14:18:26,326 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2074926.0, ans=0.0 2023-06-28 14:18:47,495 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.03 vs. limit=22.5 2023-06-28 14:18:56,634 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=2074986.0, ans=0.125 2023-06-28 14:19:00,729 INFO [train.py:996] (0/4) Epoch 12, batch 10400, loss[loss=0.1727, simple_loss=0.2329, pruned_loss=0.05624, over 21488.00 frames. ], tot_loss[loss=0.2034, simple_loss=0.2821, pruned_loss=0.06236, over 4261946.55 frames. ], batch size: 212, lr: 2.44e-03, grad_scale: 32.0 2023-06-28 14:19:25,129 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2075106.0, ans=0.1 2023-06-28 14:19:55,934 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2075166.0, ans=0.1 2023-06-28 14:19:58,464 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.939e+02 1.030e+03 1.665e+03 2.817e+03 5.984e+03, threshold=3.330e+03, percent-clipped=36.0 2023-06-28 14:20:41,711 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2075286.0, ans=0.125 2023-06-28 14:20:46,348 INFO [train.py:996] (0/4) Epoch 12, batch 10450, loss[loss=0.2174, simple_loss=0.304, pruned_loss=0.06541, over 21848.00 frames. ], tot_loss[loss=0.2074, simple_loss=0.2856, pruned_loss=0.06453, over 4259129.03 frames. ], batch size: 316, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 14:21:17,789 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.47 vs. limit=15.0 2023-06-28 14:21:28,689 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2075466.0, ans=0.1 2023-06-28 14:21:29,155 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.94 vs. limit=15.0 2023-06-28 14:22:34,297 INFO [train.py:996] (0/4) Epoch 12, batch 10500, loss[loss=0.2054, simple_loss=0.279, pruned_loss=0.06592, over 21307.00 frames. ], tot_loss[loss=0.2062, simple_loss=0.2857, pruned_loss=0.06335, over 4258936.62 frames. ], batch size: 471, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 14:23:30,263 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.346e+02 7.811e+02 1.278e+03 1.903e+03 4.033e+03, threshold=2.556e+03, percent-clipped=2.0 2023-06-28 14:23:32,556 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2075826.0, ans=0.125 2023-06-28 14:23:35,847 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2075826.0, ans=0.125 2023-06-28 14:23:48,537 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=2075826.0, ans=0.125 2023-06-28 14:23:50,410 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2075826.0, ans=0.0 2023-06-28 14:23:55,603 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=2075886.0, ans=0.2 2023-06-28 14:24:16,705 INFO [train.py:996] (0/4) Epoch 12, batch 10550, loss[loss=0.1853, simple_loss=0.2532, pruned_loss=0.05876, over 21666.00 frames. ], tot_loss[loss=0.203, simple_loss=0.2807, pruned_loss=0.0626, over 4253837.69 frames. ], batch size: 282, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 14:24:20,805 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2075946.0, ans=0.0 2023-06-28 14:24:51,906 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=2076006.0, ans=0.0 2023-06-28 14:25:42,957 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.60 vs. limit=15.0 2023-06-28 14:26:00,762 INFO [train.py:996] (0/4) Epoch 12, batch 10600, loss[loss=0.2263, simple_loss=0.3381, pruned_loss=0.05723, over 20747.00 frames. ], tot_loss[loss=0.1997, simple_loss=0.2763, pruned_loss=0.06157, over 4257457.24 frames. ], batch size: 607, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 14:26:06,716 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=2076246.0, ans=0.0 2023-06-28 14:26:14,052 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2076246.0, ans=0.125 2023-06-28 14:26:59,364 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.873e+02 6.316e+02 8.507e+02 1.506e+03 2.988e+03, threshold=1.701e+03, percent-clipped=6.0 2023-06-28 14:27:42,159 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.19 vs. limit=15.0 2023-06-28 14:27:46,051 INFO [train.py:996] (0/4) Epoch 12, batch 10650, loss[loss=0.2112, simple_loss=0.2988, pruned_loss=0.06179, over 21867.00 frames. ], tot_loss[loss=0.2001, simple_loss=0.2791, pruned_loss=0.06056, over 4248910.41 frames. ], batch size: 372, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 14:27:50,352 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=2076546.0, ans=0.07 2023-06-28 14:28:04,058 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=17.01 vs. limit=22.5 2023-06-28 14:28:13,720 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2076606.0, ans=0.125 2023-06-28 14:29:18,549 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 14:29:23,655 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2076786.0, ans=0.0 2023-06-28 14:29:29,917 INFO [train.py:996] (0/4) Epoch 12, batch 10700, loss[loss=0.1706, simple_loss=0.243, pruned_loss=0.04907, over 16587.00 frames. ], tot_loss[loss=0.1984, simple_loss=0.2769, pruned_loss=0.05997, over 4244988.33 frames. ], batch size: 60, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 14:30:32,560 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.676e+02 7.956e+02 1.277e+03 1.864e+03 4.109e+03, threshold=2.555e+03, percent-clipped=30.0 2023-06-28 14:30:55,022 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.94 vs. limit=22.5 2023-06-28 14:31:21,315 INFO [train.py:996] (0/4) Epoch 12, batch 10750, loss[loss=0.2798, simple_loss=0.3721, pruned_loss=0.09375, over 21607.00 frames. ], tot_loss[loss=0.2076, simple_loss=0.288, pruned_loss=0.06356, over 4253333.07 frames. ], batch size: 508, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 14:32:06,377 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.56 vs. limit=15.0 2023-06-28 14:32:20,638 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2077266.0, ans=0.0 2023-06-28 14:32:37,083 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.90 vs. limit=15.0 2023-06-28 14:32:43,171 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2077386.0, ans=0.125 2023-06-28 14:32:58,202 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=2077386.0, ans=0.025 2023-06-28 14:33:10,860 INFO [train.py:996] (0/4) Epoch 12, batch 10800, loss[loss=0.2513, simple_loss=0.3227, pruned_loss=0.08993, over 21443.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2921, pruned_loss=0.06438, over 4259708.79 frames. ], batch size: 194, lr: 2.43e-03, grad_scale: 32.0 2023-06-28 14:33:31,468 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2077506.0, ans=0.125 2023-06-28 14:34:08,050 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.630e+02 8.272e+02 1.352e+03 2.286e+03 6.133e+03, threshold=2.703e+03, percent-clipped=22.0 2023-06-28 14:34:14,584 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.34 vs. limit=22.5 2023-06-28 14:34:30,810 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.37 vs. limit=15.0 2023-06-28 14:34:54,580 INFO [train.py:996] (0/4) Epoch 12, batch 10850, loss[loss=0.1688, simple_loss=0.2433, pruned_loss=0.04715, over 21601.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2934, pruned_loss=0.06454, over 4265731.00 frames. ], batch size: 263, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 14:35:31,311 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2077806.0, ans=0.125 2023-06-28 14:35:55,470 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2077926.0, ans=0.125 2023-06-28 14:35:57,762 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.70 vs. limit=10.0 2023-06-28 14:36:38,972 INFO [train.py:996] (0/4) Epoch 12, batch 10900, loss[loss=0.1867, simple_loss=0.252, pruned_loss=0.0607, over 21861.00 frames. ], tot_loss[loss=0.2067, simple_loss=0.2868, pruned_loss=0.06333, over 4264561.23 frames. ], batch size: 107, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 14:36:52,785 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2078046.0, ans=0.125 2023-06-28 14:37:12,482 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2078106.0, ans=0.125 2023-06-28 14:37:36,119 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.311e+02 7.551e+02 9.581e+02 1.368e+03 2.722e+03, threshold=1.916e+03, percent-clipped=1.0 2023-06-28 14:37:40,145 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2078226.0, ans=0.125 2023-06-28 14:38:18,070 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=2078286.0, ans=0.125 2023-06-28 14:38:18,670 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.74 vs. limit=22.5 2023-06-28 14:38:20,664 INFO [train.py:996] (0/4) Epoch 12, batch 10950, loss[loss=0.1731, simple_loss=0.2439, pruned_loss=0.05113, over 21298.00 frames. ], tot_loss[loss=0.2034, simple_loss=0.2832, pruned_loss=0.06181, over 4264799.23 frames. ], batch size: 144, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 14:38:54,699 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2078406.0, ans=0.125 2023-06-28 14:39:22,603 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.88 vs. limit=10.0 2023-06-28 14:39:24,073 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.09 vs. limit=12.0 2023-06-28 14:39:25,136 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2078526.0, ans=0.0 2023-06-28 14:39:48,574 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.96 vs. limit=15.0 2023-06-28 14:40:04,312 INFO [train.py:996] (0/4) Epoch 12, batch 11000, loss[loss=0.1919, simple_loss=0.2676, pruned_loss=0.0581, over 21610.00 frames. ], tot_loss[loss=0.2034, simple_loss=0.2822, pruned_loss=0.06232, over 4269555.27 frames. ], batch size: 212, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 14:40:39,862 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=2078706.0, ans=0.0 2023-06-28 14:41:02,132 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.714e+02 8.398e+02 1.287e+03 1.832e+03 5.305e+03, threshold=2.574e+03, percent-clipped=21.0 2023-06-28 14:41:02,784 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2078826.0, ans=0.125 2023-06-28 14:41:32,026 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.81 vs. limit=8.0 2023-06-28 14:41:45,733 INFO [train.py:996] (0/4) Epoch 12, batch 11050, loss[loss=0.1896, simple_loss=0.2565, pruned_loss=0.06138, over 21800.00 frames. ], tot_loss[loss=0.203, simple_loss=0.2797, pruned_loss=0.0632, over 4269118.85 frames. ], batch size: 107, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 14:41:48,127 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=2078946.0, ans=0.07 2023-06-28 14:41:54,736 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=2078946.0, ans=0.0 2023-06-28 14:41:54,765 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=2078946.0, ans=0.04949747468305833 2023-06-28 14:42:24,924 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2079066.0, ans=0.1 2023-06-28 14:42:30,166 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2079066.0, ans=0.125 2023-06-28 14:42:38,741 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.84 vs. limit=6.0 2023-06-28 14:43:04,334 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=2079126.0, ans=0.125 2023-06-28 14:43:20,146 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.01 vs. limit=15.0 2023-06-28 14:43:23,999 INFO [train.py:996] (0/4) Epoch 12, batch 11100, loss[loss=0.1892, simple_loss=0.2574, pruned_loss=0.06047, over 21850.00 frames. ], tot_loss[loss=0.2023, simple_loss=0.2779, pruned_loss=0.06336, over 4270240.80 frames. ], batch size: 118, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 14:44:16,311 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2079366.0, ans=0.0 2023-06-28 14:44:22,371 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.287e+02 7.124e+02 1.046e+03 1.474e+03 3.228e+03, threshold=2.092e+03, percent-clipped=3.0 2023-06-28 14:45:06,639 INFO [train.py:996] (0/4) Epoch 12, batch 11150, loss[loss=0.2019, simple_loss=0.2739, pruned_loss=0.06492, over 21679.00 frames. ], tot_loss[loss=0.2003, simple_loss=0.2748, pruned_loss=0.06292, over 4270466.23 frames. ], batch size: 112, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 14:45:22,951 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.98 vs. limit=15.0 2023-06-28 14:45:41,977 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2079606.0, ans=0.125 2023-06-28 14:46:43,198 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2079786.0, ans=0.0 2023-06-28 14:46:49,370 INFO [train.py:996] (0/4) Epoch 12, batch 11200, loss[loss=0.1998, simple_loss=0.2628, pruned_loss=0.06845, over 21596.00 frames. ], tot_loss[loss=0.1998, simple_loss=0.2742, pruned_loss=0.06273, over 4272790.17 frames. ], batch size: 332, lr: 2.43e-03, grad_scale: 32.0 2023-06-28 14:47:21,391 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=2079906.0, ans=0.025 2023-06-28 14:47:48,167 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.550e+02 9.831e+02 1.329e+03 1.720e+03 5.358e+03, threshold=2.658e+03, percent-clipped=16.0 2023-06-28 14:47:51,968 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=2080026.0, ans=0.2 2023-06-28 14:47:53,723 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2080026.0, ans=0.1 2023-06-28 14:47:55,492 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2080026.0, ans=0.1 2023-06-28 14:48:30,198 INFO [train.py:996] (0/4) Epoch 12, batch 11250, loss[loss=0.1882, simple_loss=0.2791, pruned_loss=0.04863, over 21694.00 frames. ], tot_loss[loss=0.1989, simple_loss=0.2727, pruned_loss=0.06256, over 4260814.01 frames. ], batch size: 298, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 14:48:58,780 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_na.min_abs, batch_count=2080206.0, ans=0.02 2023-06-28 14:49:00,284 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=2080206.0, ans=0.5 2023-06-28 14:49:07,357 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2080206.0, ans=0.125 2023-06-28 14:49:12,102 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=2080266.0, ans=0.2 2023-06-28 14:49:33,574 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2080326.0, ans=0.0 2023-06-28 14:49:35,056 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2080326.0, ans=0.125 2023-06-28 14:50:12,586 INFO [train.py:996] (0/4) Epoch 12, batch 11300, loss[loss=0.1932, simple_loss=0.2766, pruned_loss=0.05487, over 21872.00 frames. ], tot_loss[loss=0.2012, simple_loss=0.2756, pruned_loss=0.06335, over 4272795.25 frames. ], batch size: 316, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 14:51:01,787 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=2080566.0, ans=0.07 2023-06-28 14:51:14,614 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.978e+02 7.526e+02 1.048e+03 1.657e+03 3.488e+03, threshold=2.097e+03, percent-clipped=3.0 2023-06-28 14:51:28,106 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2080626.0, ans=0.125 2023-06-28 14:51:28,693 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.27 vs. limit=22.5 2023-06-28 14:51:48,398 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 14:51:55,956 INFO [train.py:996] (0/4) Epoch 12, batch 11350, loss[loss=0.2385, simple_loss=0.3215, pruned_loss=0.07774, over 21742.00 frames. ], tot_loss[loss=0.2017, simple_loss=0.2776, pruned_loss=0.06289, over 4277768.72 frames. ], batch size: 351, lr: 2.43e-03, grad_scale: 8.0 2023-06-28 14:52:23,289 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2080806.0, ans=0.0 2023-06-28 14:52:43,141 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.66 vs. limit=22.5 2023-06-28 14:53:14,591 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2080926.0, ans=0.125 2023-06-28 14:53:51,204 INFO [train.py:996] (0/4) Epoch 12, batch 11400, loss[loss=0.2362, simple_loss=0.3296, pruned_loss=0.0714, over 21242.00 frames. ], tot_loss[loss=0.2056, simple_loss=0.2814, pruned_loss=0.06484, over 4275907.71 frames. ], batch size: 549, lr: 2.43e-03, grad_scale: 8.0 2023-06-28 14:54:06,592 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=2081106.0, ans=0.0 2023-06-28 14:54:06,717 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=2081106.0, ans=0.2 2023-06-28 14:54:25,078 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=2081166.0, ans=0.125 2023-06-28 14:54:51,842 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.874e+02 7.904e+02 1.165e+03 1.837e+03 4.416e+03, threshold=2.330e+03, percent-clipped=18.0 2023-06-28 14:55:34,027 INFO [train.py:996] (0/4) Epoch 12, batch 11450, loss[loss=0.1889, simple_loss=0.2593, pruned_loss=0.05924, over 21207.00 frames. ], tot_loss[loss=0.206, simple_loss=0.2838, pruned_loss=0.06409, over 4282093.34 frames. ], batch size: 608, lr: 2.43e-03, grad_scale: 8.0 2023-06-28 14:55:34,966 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2081346.0, ans=0.0 2023-06-28 14:55:56,707 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2081406.0, ans=0.1 2023-06-28 14:56:16,801 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2081466.0, ans=0.125 2023-06-28 14:56:16,813 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=2081466.0, ans=0.2 2023-06-28 14:57:16,142 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.58 vs. limit=6.0 2023-06-28 14:57:18,121 INFO [train.py:996] (0/4) Epoch 12, batch 11500, loss[loss=0.2553, simple_loss=0.3244, pruned_loss=0.09308, over 21788.00 frames. ], tot_loss[loss=0.2097, simple_loss=0.288, pruned_loss=0.06575, over 4280269.01 frames. ], batch size: 441, lr: 2.43e-03, grad_scale: 8.0 2023-06-28 14:57:27,121 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=2081646.0, ans=0.07 2023-06-28 14:57:52,729 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 14:58:14,834 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.94 vs. limit=8.0 2023-06-28 14:58:20,249 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.484e+02 9.519e+02 1.302e+03 1.957e+03 4.452e+03, threshold=2.605e+03, percent-clipped=16.0 2023-06-28 14:58:51,677 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2081886.0, ans=0.125 2023-06-28 14:59:03,084 INFO [train.py:996] (0/4) Epoch 12, batch 11550, loss[loss=0.2226, simple_loss=0.3252, pruned_loss=0.06003, over 21768.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2945, pruned_loss=0.06637, over 4279788.50 frames. ], batch size: 332, lr: 2.43e-03, grad_scale: 8.0 2023-06-28 14:59:09,018 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2081946.0, ans=0.125 2023-06-28 14:59:48,130 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2082066.0, ans=0.125 2023-06-28 15:00:17,182 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2082126.0, ans=0.1 2023-06-28 15:00:46,577 INFO [train.py:996] (0/4) Epoch 12, batch 11600, loss[loss=0.2293, simple_loss=0.3217, pruned_loss=0.06842, over 21281.00 frames. ], tot_loss[loss=0.223, simple_loss=0.3092, pruned_loss=0.06841, over 4274418.55 frames. ], batch size: 143, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:00:58,563 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=2082246.0, ans=0.2 2023-06-28 15:01:58,186 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.568e+02 8.664e+02 1.450e+03 2.268e+03 5.007e+03, threshold=2.901e+03, percent-clipped=18.0 2023-06-28 15:02:00,541 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=2082426.0, ans=0.2 2023-06-28 15:02:05,616 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2082426.0, ans=0.125 2023-06-28 15:02:26,760 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.37 vs. limit=12.0 2023-06-28 15:02:35,268 INFO [train.py:996] (0/4) Epoch 12, batch 11650, loss[loss=0.2263, simple_loss=0.3083, pruned_loss=0.07212, over 21235.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.3171, pruned_loss=0.06912, over 4273544.21 frames. ], batch size: 159, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:03:01,229 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.25 vs. limit=6.0 2023-06-28 15:03:56,480 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=2082786.0, ans=0.0 2023-06-28 15:04:16,901 INFO [train.py:996] (0/4) Epoch 12, batch 11700, loss[loss=0.2033, simple_loss=0.2727, pruned_loss=0.06694, over 21844.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.3102, pruned_loss=0.06868, over 4273006.14 frames. ], batch size: 102, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:04:42,119 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.25 vs. limit=6.0 2023-06-28 15:05:16,594 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=2082966.0, ans=0.125 2023-06-28 15:05:18,021 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2082966.0, ans=0.1 2023-06-28 15:05:22,741 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.474e+02 9.550e+02 1.552e+03 2.202e+03 4.902e+03, threshold=3.105e+03, percent-clipped=9.0 2023-06-28 15:05:31,698 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2083026.0, ans=0.125 2023-06-28 15:05:38,610 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=2083086.0, ans=0.0 2023-06-28 15:05:44,947 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=2083086.0, ans=10.0 2023-06-28 15:06:04,482 INFO [train.py:996] (0/4) Epoch 12, batch 11750, loss[loss=0.2236, simple_loss=0.2935, pruned_loss=0.07686, over 21506.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.3002, pruned_loss=0.06806, over 4271510.85 frames. ], batch size: 132, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:06:18,345 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2083146.0, ans=0.1 2023-06-28 15:07:37,317 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2083386.0, ans=0.1 2023-06-28 15:07:47,881 INFO [train.py:996] (0/4) Epoch 12, batch 11800, loss[loss=0.2357, simple_loss=0.3464, pruned_loss=0.06253, over 19721.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.3013, pruned_loss=0.06962, over 4274954.05 frames. ], batch size: 703, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:08:37,800 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2083566.0, ans=0.1 2023-06-28 15:08:48,884 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.444e+02 9.330e+02 1.436e+03 2.225e+03 5.022e+03, threshold=2.872e+03, percent-clipped=11.0 2023-06-28 15:09:00,626 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.84 vs. limit=15.0 2023-06-28 15:09:26,679 INFO [train.py:996] (0/4) Epoch 12, batch 11850, loss[loss=0.2391, simple_loss=0.3293, pruned_loss=0.07444, over 21666.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.3013, pruned_loss=0.06914, over 4276498.09 frames. ], batch size: 441, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:09:34,359 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2083746.0, ans=0.1 2023-06-28 15:09:49,797 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.74 vs. limit=15.0 2023-06-28 15:09:51,055 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=2083806.0, ans=0.2 2023-06-28 15:10:00,410 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=2083806.0, ans=0.2 2023-06-28 15:10:33,056 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=2083926.0, ans=0.2 2023-06-28 15:10:37,716 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2083926.0, ans=0.0 2023-06-28 15:11:10,427 INFO [train.py:996] (0/4) Epoch 12, batch 11900, loss[loss=0.1976, simple_loss=0.2831, pruned_loss=0.05605, over 21745.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.3002, pruned_loss=0.06675, over 4279797.14 frames. ], batch size: 316, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:11:11,019 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=2084046.0, ans=0.0 2023-06-28 15:11:53,980 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2084166.0, ans=0.125 2023-06-28 15:12:04,254 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2084166.0, ans=0.125 2023-06-28 15:12:13,559 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.185e+02 7.198e+02 9.065e+02 1.390e+03 3.282e+03, threshold=1.813e+03, percent-clipped=3.0 2023-06-28 15:12:44,655 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2084286.0, ans=0.0 2023-06-28 15:12:46,694 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2084286.0, ans=0.125 2023-06-28 15:12:54,377 INFO [train.py:996] (0/4) Epoch 12, batch 11950, loss[loss=0.1835, simple_loss=0.3114, pruned_loss=0.02785, over 20780.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.3, pruned_loss=0.06382, over 4269764.08 frames. ], batch size: 607, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:13:04,569 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=2084346.0, ans=0.0 2023-06-28 15:13:27,490 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2084406.0, ans=0.0 2023-06-28 15:14:02,177 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=2084526.0, ans=0.07 2023-06-28 15:14:13,267 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=2084526.0, ans=0.2 2023-06-28 15:14:19,603 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=2084586.0, ans=0.125 2023-06-28 15:14:35,767 INFO [train.py:996] (0/4) Epoch 12, batch 12000, loss[loss=0.1838, simple_loss=0.2507, pruned_loss=0.0584, over 21636.00 frames. ], tot_loss[loss=0.2078, simple_loss=0.2926, pruned_loss=0.06147, over 4270385.69 frames. ], batch size: 298, lr: 2.43e-03, grad_scale: 32.0 2023-06-28 15:14:35,769 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-28 15:14:48,760 INFO [zipformer.py:1728] (0/4) name=encoder.encoders.1.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([5.0317, 2.3641, 4.3408, 2.3193], device='cuda:0') 2023-06-28 15:14:56,361 INFO [train.py:1028] (0/4) Epoch 12, validation: loss=0.2655, simple_loss=0.3539, pruned_loss=0.08861, over 1796401.00 frames. 2023-06-28 15:14:56,362 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23714MB 2023-06-28 15:14:58,724 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=2084646.0, ans=0.0 2023-06-28 15:15:57,095 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.607e+02 7.542e+02 1.127e+03 1.617e+03 2.900e+03, threshold=2.254e+03, percent-clipped=14.0 2023-06-28 15:16:21,926 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2084886.0, ans=0.125 2023-06-28 15:16:38,993 INFO [train.py:996] (0/4) Epoch 12, batch 12050, loss[loss=0.2265, simple_loss=0.3468, pruned_loss=0.05305, over 19830.00 frames. ], tot_loss[loss=0.2072, simple_loss=0.2889, pruned_loss=0.06276, over 4268038.79 frames. ], batch size: 702, lr: 2.43e-03, grad_scale: 32.0 2023-06-28 15:16:54,661 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=2085006.0, ans=0.2 2023-06-28 15:17:33,272 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2085066.0, ans=0.125 2023-06-28 15:17:42,501 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.24 vs. limit=12.0 2023-06-28 15:17:47,260 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2085126.0, ans=0.125 2023-06-28 15:18:00,451 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2085186.0, ans=0.125 2023-06-28 15:18:01,104 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=6.72 vs. limit=22.5 2023-06-28 15:18:22,709 INFO [train.py:996] (0/4) Epoch 12, batch 12100, loss[loss=0.2153, simple_loss=0.2942, pruned_loss=0.06817, over 21630.00 frames. ], tot_loss[loss=0.2129, simple_loss=0.2935, pruned_loss=0.06619, over 4272235.31 frames. ], batch size: 263, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:19:01,787 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.55 vs. limit=15.0 2023-06-28 15:19:01,924 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.47 vs. limit=15.0 2023-06-28 15:19:26,576 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.29 vs. limit=15.0 2023-06-28 15:19:28,428 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.830e+02 8.334e+02 1.059e+03 1.614e+03 4.516e+03, threshold=2.118e+03, percent-clipped=9.0 2023-06-28 15:19:39,473 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2085426.0, ans=0.125 2023-06-28 15:20:03,064 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2085486.0, ans=0.125 2023-06-28 15:20:04,585 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=2085486.0, ans=0.2 2023-06-28 15:20:08,869 INFO [train.py:996] (0/4) Epoch 12, batch 12150, loss[loss=0.217, simple_loss=0.3113, pruned_loss=0.06134, over 20714.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2981, pruned_loss=0.06646, over 4267851.00 frames. ], batch size: 607, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:20:12,855 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=2085546.0, ans=0.125 2023-06-28 15:20:17,832 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=2085546.0, ans=0.125 2023-06-28 15:20:29,487 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.75 vs. limit=22.5 2023-06-28 15:21:42,648 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=2085786.0, ans=0.05 2023-06-28 15:21:46,300 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=2085786.0, ans=0.125 2023-06-28 15:21:48,327 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=2085786.0, ans=6.0 2023-06-28 15:21:50,476 INFO [train.py:996] (0/4) Epoch 12, batch 12200, loss[loss=0.1795, simple_loss=0.2425, pruned_loss=0.05829, over 21174.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2955, pruned_loss=0.06576, over 4272999.61 frames. ], batch size: 548, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:21:54,796 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=2085846.0, ans=0.125 2023-06-28 15:21:56,113 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2085846.0, ans=0.0 2023-06-28 15:22:31,293 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=2085906.0, ans=0.125 2023-06-28 15:22:34,599 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2085966.0, ans=0.125 2023-06-28 15:23:03,727 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.152e+02 7.335e+02 1.257e+03 1.849e+03 4.350e+03, threshold=2.514e+03, percent-clipped=17.0 2023-06-28 15:23:11,655 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.33 vs. limit=15.0 2023-06-28 15:23:17,496 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2086086.0, ans=0.125 2023-06-28 15:23:33,609 INFO [train.py:996] (0/4) Epoch 12, batch 12250, loss[loss=0.1553, simple_loss=0.2433, pruned_loss=0.03371, over 21725.00 frames. ], tot_loss[loss=0.2064, simple_loss=0.287, pruned_loss=0.06287, over 4258579.96 frames. ], batch size: 247, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:24:22,776 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=2086266.0, ans=0.125 2023-06-28 15:24:55,585 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=2086326.0, ans=0.0 2023-06-28 15:25:04,662 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=5.83 vs. limit=22.5 2023-06-28 15:25:16,906 INFO [train.py:996] (0/4) Epoch 12, batch 12300, loss[loss=0.2349, simple_loss=0.3284, pruned_loss=0.07068, over 21534.00 frames. ], tot_loss[loss=0.1995, simple_loss=0.2812, pruned_loss=0.05892, over 4253695.67 frames. ], batch size: 471, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:25:17,578 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=2086446.0, ans=0.2 2023-06-28 15:25:56,613 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=2086506.0, ans=0.0 2023-06-28 15:26:29,028 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.595e+02 6.953e+02 1.064e+03 1.818e+03 4.648e+03, threshold=2.128e+03, percent-clipped=12.0 2023-06-28 15:26:32,654 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2086626.0, ans=0.125 2023-06-28 15:26:58,277 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=2086746.0, ans=0.2 2023-06-28 15:26:59,142 INFO [train.py:996] (0/4) Epoch 12, batch 12350, loss[loss=0.2277, simple_loss=0.3086, pruned_loss=0.07344, over 21476.00 frames. ], tot_loss[loss=0.2018, simple_loss=0.2849, pruned_loss=0.05938, over 4255267.68 frames. ], batch size: 548, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:27:25,979 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.60 vs. limit=15.0 2023-06-28 15:27:27,151 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=2086806.0, ans=0.2 2023-06-28 15:27:30,574 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=2086806.0, ans=0.07 2023-06-28 15:27:40,307 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2086866.0, ans=0.125 2023-06-28 15:27:52,044 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.34 vs. limit=6.0 2023-06-28 15:28:14,246 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=2086926.0, ans=0.125 2023-06-28 15:28:26,030 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2086986.0, ans=0.0 2023-06-28 15:28:31,655 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.44 vs. limit=22.5 2023-06-28 15:28:40,053 INFO [train.py:996] (0/4) Epoch 12, batch 12400, loss[loss=0.2116, simple_loss=0.2839, pruned_loss=0.06962, over 21854.00 frames. ], tot_loss[loss=0.2058, simple_loss=0.2881, pruned_loss=0.06174, over 4264206.39 frames. ], batch size: 298, lr: 2.43e-03, grad_scale: 32.0 2023-06-28 15:28:45,817 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=2087046.0, ans=0.5 2023-06-28 15:29:14,241 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.97 vs. limit=6.0 2023-06-28 15:29:45,257 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2087166.0, ans=0.0 2023-06-28 15:29:54,407 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.210e+02 8.096e+02 1.137e+03 1.573e+03 3.341e+03, threshold=2.274e+03, percent-clipped=11.0 2023-06-28 15:30:32,678 INFO [train.py:996] (0/4) Epoch 12, batch 12450, loss[loss=0.2492, simple_loss=0.3254, pruned_loss=0.08648, over 21579.00 frames. ], tot_loss[loss=0.2107, simple_loss=0.2921, pruned_loss=0.0647, over 4269494.66 frames. ], batch size: 263, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:32:16,002 INFO [train.py:996] (0/4) Epoch 12, batch 12500, loss[loss=0.2539, simple_loss=0.3603, pruned_loss=0.07373, over 21890.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.3018, pruned_loss=0.0677, over 4269897.92 frames. ], batch size: 372, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:32:56,999 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.00 vs. limit=15.0 2023-06-28 15:33:14,443 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.30 vs. limit=15.0 2023-06-28 15:33:22,115 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.609e+02 8.449e+02 1.202e+03 1.905e+03 3.240e+03, threshold=2.404e+03, percent-clipped=12.0 2023-06-28 15:33:44,343 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=2087886.0, ans=0.2 2023-06-28 15:33:58,304 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=2087886.0, ans=0.125 2023-06-28 15:34:05,900 INFO [train.py:996] (0/4) Epoch 12, batch 12550, loss[loss=0.27, simple_loss=0.3326, pruned_loss=0.1037, over 21433.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.3076, pruned_loss=0.07037, over 4268512.07 frames. ], batch size: 471, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:34:24,496 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-348000.pt 2023-06-28 15:34:38,687 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=2088006.0, ans=0.2 2023-06-28 15:35:06,385 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=2088126.0, ans=0.07 2023-06-28 15:35:14,795 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=2088126.0, ans=0.125 2023-06-28 15:35:24,553 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=2088126.0, ans=0.2 2023-06-28 15:35:43,056 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2088186.0, ans=0.125 2023-06-28 15:35:50,653 INFO [train.py:996] (0/4) Epoch 12, batch 12600, loss[loss=0.2122, simple_loss=0.2914, pruned_loss=0.06646, over 21393.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.3074, pruned_loss=0.06898, over 4264020.95 frames. ], batch size: 211, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:36:01,095 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2088246.0, ans=0.1 2023-06-28 15:36:34,989 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.97 vs. limit=15.0 2023-06-28 15:36:59,615 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.816e+02 8.006e+02 1.115e+03 1.640e+03 2.498e+03, threshold=2.229e+03, percent-clipped=4.0 2023-06-28 15:36:59,969 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=2088426.0, ans=0.125 2023-06-28 15:37:27,728 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.58 vs. limit=22.5 2023-06-28 15:37:31,411 INFO [train.py:996] (0/4) Epoch 12, batch 12650, loss[loss=0.2164, simple_loss=0.2861, pruned_loss=0.07334, over 21473.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2987, pruned_loss=0.06545, over 4260083.34 frames. ], batch size: 194, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:38:01,602 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2088606.0, ans=0.1 2023-06-28 15:38:06,759 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 15:38:08,384 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=2088666.0, ans=0.0 2023-06-28 15:38:11,633 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=2088666.0, ans=0.0 2023-06-28 15:38:57,985 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 15:39:18,833 INFO [train.py:996] (0/4) Epoch 12, batch 12700, loss[loss=0.2418, simple_loss=0.3245, pruned_loss=0.07957, over 21896.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2965, pruned_loss=0.06708, over 4265278.96 frames. ], batch size: 124, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:40:16,883 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2089026.0, ans=0.0 2023-06-28 15:40:22,985 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.191e+02 7.639e+02 1.038e+03 1.743e+03 3.264e+03, threshold=2.075e+03, percent-clipped=12.0 2023-06-28 15:40:55,011 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=2089086.0, ans=0.0 2023-06-28 15:41:01,161 INFO [train.py:996] (0/4) Epoch 12, batch 12750, loss[loss=0.1947, simple_loss=0.2839, pruned_loss=0.05274, over 21763.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2974, pruned_loss=0.06666, over 4267446.31 frames. ], batch size: 298, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:41:29,779 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2089206.0, ans=0.0 2023-06-28 15:41:40,098 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.51 vs. limit=12.0 2023-06-28 15:41:54,157 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2089266.0, ans=0.0 2023-06-28 15:42:14,314 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=2089326.0, ans=0.0 2023-06-28 15:42:29,359 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.70 vs. limit=22.5 2023-06-28 15:42:43,118 INFO [train.py:996] (0/4) Epoch 12, batch 12800, loss[loss=0.2411, simple_loss=0.3094, pruned_loss=0.08638, over 21799.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.296, pruned_loss=0.06694, over 4273316.72 frames. ], batch size: 441, lr: 2.43e-03, grad_scale: 32.0 2023-06-28 15:43:15,260 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=2089506.0, ans=0.2 2023-06-28 15:43:40,701 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2089566.0, ans=0.125 2023-06-28 15:43:52,887 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.21 vs. limit=10.0 2023-06-28 15:43:55,078 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.524e+02 8.190e+02 1.176e+03 1.690e+03 3.535e+03, threshold=2.352e+03, percent-clipped=8.0 2023-06-28 15:43:57,887 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.33 vs. limit=6.0 2023-06-28 15:44:14,575 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=2089686.0, ans=0.0 2023-06-28 15:44:27,202 INFO [train.py:996] (0/4) Epoch 12, batch 12850, loss[loss=0.254, simple_loss=0.3445, pruned_loss=0.08177, over 21502.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2974, pruned_loss=0.06863, over 4275061.25 frames. ], batch size: 508, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:44:49,743 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2089806.0, ans=0.1 2023-06-28 15:46:06,977 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.46 vs. limit=15.0 2023-06-28 15:46:14,572 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2090046.0, ans=0.125 2023-06-28 15:46:15,557 INFO [train.py:996] (0/4) Epoch 12, batch 12900, loss[loss=0.2302, simple_loss=0.3217, pruned_loss=0.06937, over 21573.00 frames. ], tot_loss[loss=0.2123, simple_loss=0.2942, pruned_loss=0.06519, over 4276412.18 frames. ], batch size: 441, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:47:20,056 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2090226.0, ans=0.0 2023-06-28 15:47:25,954 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.842e+02 7.681e+02 1.209e+03 1.743e+03 3.932e+03, threshold=2.418e+03, percent-clipped=11.0 2023-06-28 15:47:41,715 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=2090286.0, ans=0.2 2023-06-28 15:48:02,362 INFO [train.py:996] (0/4) Epoch 12, batch 12950, loss[loss=0.1739, simple_loss=0.2537, pruned_loss=0.04706, over 20186.00 frames. ], tot_loss[loss=0.2085, simple_loss=0.2913, pruned_loss=0.06288, over 4272354.23 frames. ], batch size: 703, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:48:21,406 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2090406.0, ans=0.0 2023-06-28 15:48:52,875 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.38 vs. limit=10.0 2023-06-28 15:49:44,547 INFO [train.py:996] (0/4) Epoch 12, batch 13000, loss[loss=0.2151, simple_loss=0.2926, pruned_loss=0.06883, over 20628.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.2923, pruned_loss=0.06392, over 4270823.72 frames. ], batch size: 607, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:50:23,094 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=2090706.0, ans=0.0 2023-06-28 15:50:26,357 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=2090766.0, ans=0.0 2023-06-28 15:50:49,324 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2090826.0, ans=0.1 2023-06-28 15:50:50,318 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.116e+02 7.680e+02 1.047e+03 1.374e+03 2.853e+03, threshold=2.094e+03, percent-clipped=2.0 2023-06-28 15:50:56,526 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.55 vs. limit=15.0 2023-06-28 15:51:25,517 INFO [train.py:996] (0/4) Epoch 12, batch 13050, loss[loss=0.2208, simple_loss=0.2909, pruned_loss=0.07532, over 21803.00 frames. ], tot_loss[loss=0.2058, simple_loss=0.2879, pruned_loss=0.06183, over 4265969.10 frames. ], batch size: 441, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:51:36,036 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=2090946.0, ans=0.2 2023-06-28 15:52:26,642 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=2091126.0, ans=0.125 2023-06-28 15:53:07,013 INFO [train.py:996] (0/4) Epoch 12, batch 13100, loss[loss=0.2023, simple_loss=0.2889, pruned_loss=0.05788, over 21657.00 frames. ], tot_loss[loss=0.2065, simple_loss=0.2889, pruned_loss=0.06199, over 4274641.35 frames. ], batch size: 263, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:53:09,449 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2091246.0, ans=0.125 2023-06-28 15:53:10,664 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=2091246.0, ans=0.1 2023-06-28 15:53:33,376 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.56 vs. limit=15.0 2023-06-28 15:54:14,293 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.157e+02 6.908e+02 8.185e+02 1.188e+03 2.631e+03, threshold=1.637e+03, percent-clipped=2.0 2023-06-28 15:54:41,941 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2091486.0, ans=0.125 2023-06-28 15:54:50,927 INFO [train.py:996] (0/4) Epoch 12, batch 13150, loss[loss=0.2314, simple_loss=0.3101, pruned_loss=0.07636, over 21786.00 frames. ], tot_loss[loss=0.2108, simple_loss=0.2918, pruned_loss=0.06485, over 4273114.10 frames. ], batch size: 124, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:54:51,544 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2091546.0, ans=0.125 2023-06-28 15:55:48,684 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.25 vs. limit=22.5 2023-06-28 15:56:14,157 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2091786.0, ans=0.0 2023-06-28 15:56:37,608 INFO [train.py:996] (0/4) Epoch 12, batch 13200, loss[loss=0.2156, simple_loss=0.289, pruned_loss=0.0711, over 21276.00 frames. ], tot_loss[loss=0.2097, simple_loss=0.2904, pruned_loss=0.06454, over 4274690.47 frames. ], batch size: 176, lr: 2.43e-03, grad_scale: 32.0 2023-06-28 15:56:55,090 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2091906.0, ans=0.125 2023-06-28 15:56:56,560 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=2091906.0, ans=0.125 2023-06-28 15:57:24,032 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=2091966.0, ans=0.125 2023-06-28 15:57:27,569 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=2091966.0, ans=0.2 2023-06-28 15:57:38,467 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=2092026.0, ans=0.09899494936611666 2023-06-28 15:57:46,671 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.053e+02 7.241e+02 1.163e+03 1.743e+03 3.163e+03, threshold=2.326e+03, percent-clipped=27.0 2023-06-28 15:58:21,568 INFO [train.py:996] (0/4) Epoch 12, batch 13250, loss[loss=0.1824, simple_loss=0.2606, pruned_loss=0.05209, over 21769.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2921, pruned_loss=0.06737, over 4281771.43 frames. ], batch size: 282, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:58:37,697 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=2092206.0, ans=0.0 2023-06-28 15:58:55,719 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=2092206.0, ans=0.125 2023-06-28 15:59:08,628 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2092266.0, ans=0.0 2023-06-28 15:59:12,312 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=2092266.0, ans=0.025 2023-06-28 15:59:57,888 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.79 vs. limit=15.0 2023-06-28 16:00:05,152 INFO [train.py:996] (0/4) Epoch 12, batch 13300, loss[loss=0.2115, simple_loss=0.2969, pruned_loss=0.06302, over 21493.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.2955, pruned_loss=0.06693, over 4286691.82 frames. ], batch size: 211, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 16:00:08,939 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2092446.0, ans=0.1 2023-06-28 16:00:59,366 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=2092566.0, ans=0.05 2023-06-28 16:01:00,019 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.09 vs. limit=15.0 2023-06-28 16:01:01,058 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2092566.0, ans=0.1 2023-06-28 16:01:23,229 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.410e+02 8.803e+02 1.227e+03 2.112e+03 5.928e+03, threshold=2.454e+03, percent-clipped=20.0 2023-06-28 16:01:30,922 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2092686.0, ans=0.125 2023-06-28 16:01:43,347 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.11 vs. limit=22.5 2023-06-28 16:01:48,706 INFO [train.py:996] (0/4) Epoch 12, batch 13350, loss[loss=0.2183, simple_loss=0.306, pruned_loss=0.06532, over 20671.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.2999, pruned_loss=0.06911, over 4288314.09 frames. ], batch size: 607, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 16:02:11,984 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2092806.0, ans=0.1 2023-06-28 16:02:19,836 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=2092806.0, ans=0.125 2023-06-28 16:02:24,931 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 16:03:03,956 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=2092926.0, ans=0.125 2023-06-28 16:03:34,264 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=2093046.0, ans=0.0 2023-06-28 16:03:35,154 INFO [train.py:996] (0/4) Epoch 12, batch 13400, loss[loss=0.2768, simple_loss=0.3328, pruned_loss=0.1105, over 21522.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.3015, pruned_loss=0.07132, over 4284133.20 frames. ], batch size: 507, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 16:03:54,873 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=18.73 vs. limit=22.5 2023-06-28 16:04:08,011 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.44 vs. limit=15.0 2023-06-28 16:04:22,597 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=2093166.0, ans=0.125 2023-06-28 16:04:39,698 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=2093226.0, ans=0.125 2023-06-28 16:04:44,339 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.284e+02 9.193e+02 1.348e+03 2.044e+03 4.158e+03, threshold=2.695e+03, percent-clipped=16.0 2023-06-28 16:05:12,981 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2093346.0, ans=0.1 2023-06-28 16:05:14,145 INFO [train.py:996] (0/4) Epoch 12, batch 13450, loss[loss=0.1897, simple_loss=0.2628, pruned_loss=0.05828, over 17220.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.3021, pruned_loss=0.07289, over 4279573.15 frames. ], batch size: 60, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 16:05:28,060 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 16:06:58,132 INFO [train.py:996] (0/4) Epoch 12, batch 13500, loss[loss=0.1442, simple_loss=0.1863, pruned_loss=0.05105, over 15779.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2927, pruned_loss=0.06941, over 4276499.66 frames. ], batch size: 61, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 16:07:50,888 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=2093766.0, ans=0.0 2023-06-28 16:07:54,180 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=2093766.0, ans=0.0 2023-06-28 16:07:58,133 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.33 vs. limit=22.5 2023-06-28 16:08:07,379 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2093826.0, ans=0.125 2023-06-28 16:08:13,368 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.877e+02 6.964e+02 1.038e+03 1.541e+03 3.052e+03, threshold=2.076e+03, percent-clipped=2.0 2023-06-28 16:08:34,017 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=2093886.0, ans=0.0 2023-06-28 16:08:43,473 INFO [train.py:996] (0/4) Epoch 12, batch 13550, loss[loss=0.228, simple_loss=0.3299, pruned_loss=0.06305, over 21617.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.2967, pruned_loss=0.06838, over 4279248.58 frames. ], batch size: 230, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 16:09:20,227 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2094006.0, ans=0.125 2023-06-28 16:09:31,700 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2094066.0, ans=0.1 2023-06-28 16:09:38,481 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2094066.0, ans=0.125 2023-06-28 16:10:00,806 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=2094126.0, ans=0.125 2023-06-28 16:10:12,200 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=2094186.0, ans=0.125 2023-06-28 16:10:26,354 INFO [train.py:996] (0/4) Epoch 12, batch 13600, loss[loss=0.2167, simple_loss=0.2957, pruned_loss=0.06879, over 21759.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2994, pruned_loss=0.06914, over 4285955.25 frames. ], batch size: 389, lr: 2.42e-03, grad_scale: 32.0 2023-06-28 16:10:28,698 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2094246.0, ans=0.0 2023-06-28 16:11:39,038 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.839e+02 7.727e+02 1.210e+03 1.733e+03 4.112e+03, threshold=2.419e+03, percent-clipped=15.0 2023-06-28 16:11:57,680 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=2094486.0, ans=0.125 2023-06-28 16:12:13,566 INFO [train.py:996] (0/4) Epoch 12, batch 13650, loss[loss=0.1828, simple_loss=0.2523, pruned_loss=0.05667, over 21687.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2935, pruned_loss=0.06606, over 4270515.45 frames. ], batch size: 282, lr: 2.42e-03, grad_scale: 32.0 2023-06-28 16:12:24,688 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.26 vs. limit=6.0 2023-06-28 16:12:32,806 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2094546.0, ans=0.0 2023-06-28 16:12:44,747 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.33 vs. limit=15.0 2023-06-28 16:12:46,449 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 16:12:51,120 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2094666.0, ans=0.1 2023-06-28 16:13:39,182 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=2094786.0, ans=0.2 2023-06-28 16:13:57,103 INFO [train.py:996] (0/4) Epoch 12, batch 13700, loss[loss=0.217, simple_loss=0.2982, pruned_loss=0.06787, over 21646.00 frames. ], tot_loss[loss=0.209, simple_loss=0.2866, pruned_loss=0.06567, over 4258947.65 frames. ], batch size: 414, lr: 2.42e-03, grad_scale: 8.0 2023-06-28 16:14:04,525 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=2094846.0, ans=0.05 2023-06-28 16:14:07,647 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=2094846.0, ans=0.125 2023-06-28 16:14:11,624 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.69 vs. limit=15.0 2023-06-28 16:14:14,617 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=2094846.0, ans=0.0 2023-06-28 16:14:23,321 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2094906.0, ans=0.125 2023-06-28 16:15:09,459 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=2095026.0, ans=0.2 2023-06-28 16:15:09,909 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.02 vs. limit=15.0 2023-06-28 16:15:10,905 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2095026.0, ans=0.1 2023-06-28 16:15:15,684 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.251e+02 7.877e+02 1.121e+03 1.931e+03 5.975e+03, threshold=2.242e+03, percent-clipped=12.0 2023-06-28 16:15:44,898 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2095086.0, ans=0.0 2023-06-28 16:15:47,602 INFO [train.py:996] (0/4) Epoch 12, batch 13750, loss[loss=0.1875, simple_loss=0.2673, pruned_loss=0.05386, over 21560.00 frames. ], tot_loss[loss=0.2095, simple_loss=0.2867, pruned_loss=0.06612, over 4261911.02 frames. ], batch size: 230, lr: 2.42e-03, grad_scale: 8.0 2023-06-28 16:16:05,992 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2095206.0, ans=0.1 2023-06-28 16:16:29,894 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2095266.0, ans=0.1 2023-06-28 16:16:42,346 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.91 vs. limit=15.0 2023-06-28 16:16:50,702 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2095326.0, ans=0.0 2023-06-28 16:16:58,192 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.12 vs. limit=15.0 2023-06-28 16:17:20,499 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.09 vs. limit=15.0 2023-06-28 16:17:23,889 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.74 vs. limit=15.0 2023-06-28 16:17:34,236 INFO [train.py:996] (0/4) Epoch 12, batch 13800, loss[loss=0.1838, simple_loss=0.2663, pruned_loss=0.0507, over 21077.00 frames. ], tot_loss[loss=0.2099, simple_loss=0.2912, pruned_loss=0.06425, over 4262759.46 frames. ], batch size: 143, lr: 2.42e-03, grad_scale: 8.0 2023-06-28 16:18:56,443 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.457e+02 7.505e+02 1.008e+03 1.759e+03 5.617e+03, threshold=2.016e+03, percent-clipped=13.0 2023-06-28 16:19:01,129 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.63 vs. limit=6.0 2023-06-28 16:19:18,270 INFO [train.py:996] (0/4) Epoch 12, batch 13850, loss[loss=0.2427, simple_loss=0.3239, pruned_loss=0.08079, over 21584.00 frames. ], tot_loss[loss=0.2119, simple_loss=0.2957, pruned_loss=0.06405, over 4268310.74 frames. ], batch size: 230, lr: 2.42e-03, grad_scale: 8.0 2023-06-28 16:19:19,105 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=2095746.0, ans=0.0 2023-06-28 16:20:14,711 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2095866.0, ans=0.125 2023-06-28 16:20:33,022 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=2095926.0, ans=0.2 2023-06-28 16:20:42,867 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=2095986.0, ans=0.2 2023-06-28 16:21:05,031 INFO [train.py:996] (0/4) Epoch 12, batch 13900, loss[loss=0.2482, simple_loss=0.3139, pruned_loss=0.09128, over 21550.00 frames. ], tot_loss[loss=0.218, simple_loss=0.3005, pruned_loss=0.06777, over 4273636.10 frames. ], batch size: 471, lr: 2.42e-03, grad_scale: 8.0 2023-06-28 16:22:20,627 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.853e+02 9.390e+02 1.248e+03 1.935e+03 5.140e+03, threshold=2.497e+03, percent-clipped=23.0 2023-06-28 16:22:34,853 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=2096286.0, ans=0.07 2023-06-28 16:22:47,371 INFO [train.py:996] (0/4) Epoch 12, batch 13950, loss[loss=0.2119, simple_loss=0.2902, pruned_loss=0.06677, over 21781.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.2998, pruned_loss=0.06961, over 4283521.91 frames. ], batch size: 282, lr: 2.42e-03, grad_scale: 8.0 2023-06-28 16:22:47,904 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2096346.0, ans=0.1 2023-06-28 16:24:17,264 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=2096586.0, ans=0.0 2023-06-28 16:24:23,128 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.37 vs. limit=10.0 2023-06-28 16:24:25,031 INFO [train.py:996] (0/4) Epoch 12, batch 14000, loss[loss=0.172, simple_loss=0.2579, pruned_loss=0.043, over 21784.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2962, pruned_loss=0.06783, over 4280702.00 frames. ], batch size: 351, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 16:25:01,680 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=2096706.0, ans=0.125 2023-06-28 16:25:45,215 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.785e+02 7.339e+02 1.044e+03 1.507e+03 3.234e+03, threshold=2.088e+03, percent-clipped=5.0 2023-06-28 16:26:00,877 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.21 vs. limit=10.0 2023-06-28 16:26:11,282 INFO [train.py:996] (0/4) Epoch 12, batch 14050, loss[loss=0.1807, simple_loss=0.2565, pruned_loss=0.0525, over 21702.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.2912, pruned_loss=0.06449, over 4276250.79 frames. ], batch size: 282, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 16:26:31,234 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=2097006.0, ans=0.0 2023-06-28 16:26:57,396 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=2097066.0, ans=0.2 2023-06-28 16:27:32,808 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.31 vs. limit=15.0 2023-06-28 16:27:38,711 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=2097186.0, ans=0.0 2023-06-28 16:27:52,887 INFO [train.py:996] (0/4) Epoch 12, batch 14100, loss[loss=0.1901, simple_loss=0.2562, pruned_loss=0.06203, over 21831.00 frames. ], tot_loss[loss=0.2067, simple_loss=0.2849, pruned_loss=0.06425, over 4274476.20 frames. ], batch size: 98, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 16:28:03,346 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2097246.0, ans=0.0 2023-06-28 16:28:41,052 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=2097366.0, ans=0.04949747468305833 2023-06-28 16:28:57,632 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=2097426.0, ans=0.05 2023-06-28 16:29:02,413 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2097426.0, ans=0.1 2023-06-28 16:29:08,547 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.981e+02 8.924e+02 1.261e+03 1.864e+03 4.328e+03, threshold=2.523e+03, percent-clipped=18.0 2023-06-28 16:29:10,822 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=2097426.0, ans=0.0 2023-06-28 16:29:28,357 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2097546.0, ans=0.0 2023-06-28 16:29:29,311 INFO [train.py:996] (0/4) Epoch 12, batch 14150, loss[loss=0.2145, simple_loss=0.3, pruned_loss=0.06446, over 21359.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.2887, pruned_loss=0.06487, over 4258160.55 frames. ], batch size: 159, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 16:29:38,867 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2097546.0, ans=0.1 2023-06-28 16:29:50,147 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=2097606.0, ans=0.2 2023-06-28 16:29:58,527 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=2097606.0, ans=0.0 2023-06-28 16:30:36,001 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2097726.0, ans=0.1 2023-06-28 16:30:48,857 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=2097726.0, ans=0.0 2023-06-28 16:30:48,964 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=2097726.0, ans=0.0 2023-06-28 16:31:01,942 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2097786.0, ans=0.125 2023-06-28 16:31:07,791 INFO [train.py:996] (0/4) Epoch 12, batch 14200, loss[loss=0.1968, simple_loss=0.2682, pruned_loss=0.06273, over 21634.00 frames. ], tot_loss[loss=0.2088, simple_loss=0.2894, pruned_loss=0.06406, over 4257108.33 frames. ], batch size: 230, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 16:31:17,158 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.31 vs. limit=15.0 2023-06-28 16:31:21,144 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 16:32:16,912 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=2098026.0, ans=0.2 2023-06-28 16:32:21,329 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.921e+02 6.799e+02 8.924e+02 1.241e+03 3.377e+03, threshold=1.785e+03, percent-clipped=4.0 2023-06-28 16:32:40,180 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2098086.0, ans=0.0 2023-06-28 16:32:47,806 INFO [train.py:996] (0/4) Epoch 12, batch 14250, loss[loss=0.1743, simple_loss=0.2396, pruned_loss=0.05455, over 21480.00 frames. ], tot_loss[loss=0.2061, simple_loss=0.284, pruned_loss=0.06413, over 4266449.80 frames. ], batch size: 195, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 16:32:48,611 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2098146.0, ans=0.125 2023-06-28 16:32:50,000 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=2098146.0, ans=0.125 2023-06-28 16:33:51,975 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2098266.0, ans=0.125 2023-06-28 16:34:18,679 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.78 vs. limit=22.5 2023-06-28 16:34:34,315 INFO [train.py:996] (0/4) Epoch 12, batch 14300, loss[loss=0.2153, simple_loss=0.3309, pruned_loss=0.04986, over 19644.00 frames. ], tot_loss[loss=0.2058, simple_loss=0.286, pruned_loss=0.06281, over 4263195.84 frames. ], batch size: 702, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 16:34:51,762 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=2098446.0, ans=0.05 2023-06-28 16:34:53,541 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=2098446.0, ans=0.125 2023-06-28 16:35:06,894 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2098506.0, ans=0.125 2023-06-28 16:35:31,648 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2098566.0, ans=0.125 2023-06-28 16:35:55,445 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.012e+02 8.237e+02 1.255e+03 2.124e+03 4.385e+03, threshold=2.511e+03, percent-clipped=34.0 2023-06-28 16:36:17,109 INFO [train.py:996] (0/4) Epoch 12, batch 14350, loss[loss=0.2248, simple_loss=0.3112, pruned_loss=0.06917, over 21740.00 frames. ], tot_loss[loss=0.2095, simple_loss=0.2928, pruned_loss=0.0631, over 4254433.64 frames. ], batch size: 441, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 16:36:22,814 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2098746.0, ans=0.1 2023-06-28 16:37:14,578 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.63 vs. limit=15.0 2023-06-28 16:37:58,189 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=2099046.0, ans=0.0 2023-06-28 16:37:59,252 INFO [train.py:996] (0/4) Epoch 12, batch 14400, loss[loss=0.1901, simple_loss=0.2639, pruned_loss=0.05818, over 21480.00 frames. ], tot_loss[loss=0.2088, simple_loss=0.29, pruned_loss=0.06381, over 4263496.16 frames. ], batch size: 195, lr: 2.42e-03, grad_scale: 32.0 2023-06-28 16:39:15,431 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=2099226.0, ans=0.125 2023-06-28 16:39:18,229 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.847e+02 6.927e+02 1.038e+03 1.645e+03 3.908e+03, threshold=2.076e+03, percent-clipped=8.0 2023-06-28 16:39:18,857 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=2099226.0, ans=0.0 2023-06-28 16:39:19,324 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.86 vs. limit=10.0 2023-06-28 16:39:39,678 INFO [train.py:996] (0/4) Epoch 12, batch 14450, loss[loss=0.2036, simple_loss=0.2699, pruned_loss=0.06869, over 21289.00 frames. ], tot_loss[loss=0.2059, simple_loss=0.2836, pruned_loss=0.06412, over 4263221.49 frames. ], batch size: 177, lr: 2.42e-03, grad_scale: 32.0 2023-06-28 16:40:37,371 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=2099466.0, ans=0.2 2023-06-28 16:40:50,678 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2099526.0, ans=0.0 2023-06-28 16:41:23,318 INFO [train.py:996] (0/4) Epoch 12, batch 14500, loss[loss=0.1877, simple_loss=0.2791, pruned_loss=0.04817, over 21583.00 frames. ], tot_loss[loss=0.2041, simple_loss=0.2802, pruned_loss=0.06396, over 4267207.41 frames. ], batch size: 263, lr: 2.42e-03, grad_scale: 32.0 2023-06-28 16:42:02,330 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.05 vs. limit=15.0 2023-06-28 16:42:21,574 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=2099766.0, ans=0.125 2023-06-28 16:42:24,891 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2099766.0, ans=0.125 2023-06-28 16:42:46,533 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.162e+02 7.722e+02 1.013e+03 1.611e+03 2.945e+03, threshold=2.026e+03, percent-clipped=11.0 2023-06-28 16:42:48,812 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2099886.0, ans=0.125 2023-06-28 16:42:55,447 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2099886.0, ans=0.0 2023-06-28 16:43:05,984 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2099946.0, ans=0.125 2023-06-28 16:43:11,635 INFO [train.py:996] (0/4) Epoch 12, batch 14550, loss[loss=0.2408, simple_loss=0.3185, pruned_loss=0.08153, over 21689.00 frames. ], tot_loss[loss=0.2064, simple_loss=0.2835, pruned_loss=0.06462, over 4265933.29 frames. ], batch size: 351, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 16:43:25,276 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=2099946.0, ans=0.0 2023-06-28 16:43:52,448 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.56 vs. limit=15.0 2023-06-28 16:44:22,635 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 16:44:24,990 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.72 vs. limit=5.0 2023-06-28 16:44:27,991 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=2100126.0, ans=0.125 2023-06-28 16:44:40,526 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.35 vs. limit=15.0 2023-06-28 16:44:58,825 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=2100246.0, ans=0.0 2023-06-28 16:44:59,764 INFO [train.py:996] (0/4) Epoch 12, batch 14600, loss[loss=0.2403, simple_loss=0.3258, pruned_loss=0.0774, over 21424.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2911, pruned_loss=0.06784, over 4264093.82 frames. ], batch size: 131, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 16:45:07,563 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.56 vs. limit=12.0 2023-06-28 16:45:51,051 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=2100366.0, ans=10.0 2023-06-28 16:46:12,062 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.268e+02 8.624e+02 1.300e+03 2.155e+03 4.412e+03, threshold=2.599e+03, percent-clipped=26.0 2023-06-28 16:46:41,592 INFO [train.py:996] (0/4) Epoch 12, batch 14650, loss[loss=0.1582, simple_loss=0.2508, pruned_loss=0.03276, over 21613.00 frames. ], tot_loss[loss=0.215, simple_loss=0.2947, pruned_loss=0.06766, over 4261844.03 frames. ], batch size: 230, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 16:46:50,480 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2100546.0, ans=0.125 2023-06-28 16:47:18,379 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=2100666.0, ans=0.2 2023-06-28 16:47:38,374 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2100666.0, ans=0.0 2023-06-28 16:47:46,905 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2100726.0, ans=0.125 2023-06-28 16:47:51,724 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=2100726.0, ans=0.2 2023-06-28 16:48:15,511 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=2100786.0, ans=0.2 2023-06-28 16:48:24,619 INFO [train.py:996] (0/4) Epoch 12, batch 14700, loss[loss=0.1882, simple_loss=0.2798, pruned_loss=0.04827, over 21582.00 frames. ], tot_loss[loss=0.2083, simple_loss=0.2903, pruned_loss=0.06314, over 4269497.42 frames. ], batch size: 263, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 16:48:40,533 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2100846.0, ans=0.1 2023-06-28 16:49:00,678 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2100906.0, ans=0.125 2023-06-28 16:49:22,441 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.34 vs. limit=15.0 2023-06-28 16:49:40,148 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.485e+02 7.512e+02 1.036e+03 1.553e+03 3.154e+03, threshold=2.072e+03, percent-clipped=4.0 2023-06-28 16:50:12,561 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=2101086.0, ans=0.125 2023-06-28 16:50:15,513 INFO [train.py:996] (0/4) Epoch 12, batch 14750, loss[loss=0.3261, simple_loss=0.3988, pruned_loss=0.1267, over 21721.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2966, pruned_loss=0.06585, over 4266078.45 frames. ], batch size: 441, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 16:50:45,503 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=2101206.0, ans=0.125 2023-06-28 16:51:02,205 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=2101266.0, ans=0.0 2023-06-28 16:51:17,534 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2101326.0, ans=0.125 2023-06-28 16:51:46,180 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=2101386.0, ans=0.125 2023-06-28 16:51:58,901 INFO [train.py:996] (0/4) Epoch 12, batch 14800, loss[loss=0.2214, simple_loss=0.2999, pruned_loss=0.07144, over 21256.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.308, pruned_loss=0.07072, over 4270553.49 frames. ], batch size: 549, lr: 2.42e-03, grad_scale: 32.0 2023-06-28 16:52:21,503 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2101506.0, ans=0.1 2023-06-28 16:53:21,956 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=2101626.0, ans=0.5 2023-06-28 16:53:24,531 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.184e+02 7.719e+02 1.255e+03 2.135e+03 5.182e+03, threshold=2.510e+03, percent-clipped=29.0 2023-06-28 16:53:25,094 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2101626.0, ans=0.1 2023-06-28 16:53:50,385 INFO [train.py:996] (0/4) Epoch 12, batch 14850, loss[loss=0.182, simple_loss=0.2541, pruned_loss=0.05492, over 21868.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.3001, pruned_loss=0.07002, over 4268195.87 frames. ], batch size: 107, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 16:54:02,808 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 16:54:06,210 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=2101806.0, ans=0.125 2023-06-28 16:54:37,309 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.55 vs. limit=5.0 2023-06-28 16:55:34,859 INFO [train.py:996] (0/4) Epoch 12, batch 14900, loss[loss=0.2071, simple_loss=0.2872, pruned_loss=0.0635, over 21428.00 frames. ], tot_loss[loss=0.221, simple_loss=0.3004, pruned_loss=0.07084, over 4271079.01 frames. ], batch size: 211, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 16:55:38,718 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2102046.0, ans=0.125 2023-06-28 16:55:54,916 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.70 vs. limit=15.0 2023-06-28 16:56:20,698 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=2102166.0, ans=0.05 2023-06-28 16:56:24,614 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.24 vs. limit=22.5 2023-06-28 16:56:27,525 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2102166.0, ans=0.1 2023-06-28 16:56:44,304 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=2102226.0, ans=0.125 2023-06-28 16:56:55,263 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.996e+02 9.079e+02 1.317e+03 1.882e+03 4.138e+03, threshold=2.634e+03, percent-clipped=10.0 2023-06-28 16:57:13,153 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=2102346.0, ans=0.125 2023-06-28 16:57:14,113 INFO [train.py:996] (0/4) Epoch 12, batch 14950, loss[loss=0.2008, simple_loss=0.2855, pruned_loss=0.05805, over 21248.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.3014, pruned_loss=0.07059, over 4268089.76 frames. ], batch size: 176, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 16:58:11,794 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.21 vs. limit=15.0 2023-06-28 16:58:27,919 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2102526.0, ans=0.0 2023-06-28 16:58:48,353 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2102586.0, ans=0.125 2023-06-28 16:58:52,430 INFO [train.py:996] (0/4) Epoch 12, batch 15000, loss[loss=0.2263, simple_loss=0.297, pruned_loss=0.07783, over 21490.00 frames. ], tot_loss[loss=0.2219, simple_loss=0.3022, pruned_loss=0.07083, over 4272746.41 frames. ], batch size: 548, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 16:58:52,431 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-28 16:59:06,760 INFO [zipformer.py:1728] (0/4) name=encoder.encoders.5.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([2.2545, 2.1299, 4.0135, 3.7081], device='cuda:0') 2023-06-28 16:59:11,975 INFO [train.py:1028] (0/4) Epoch 12, validation: loss=0.2573, simple_loss=0.3458, pruned_loss=0.08437, over 1796401.00 frames. 2023-06-28 16:59:11,976 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23714MB 2023-06-28 16:59:35,075 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.45 vs. limit=15.0 2023-06-28 16:59:57,373 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2102766.0, ans=0.1 2023-06-28 17:00:07,248 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2102766.0, ans=0.1 2023-06-28 17:00:07,295 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2102766.0, ans=0.0 2023-06-28 17:00:08,707 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2102766.0, ans=0.0 2023-06-28 17:00:28,472 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.412e+02 7.471e+02 1.040e+03 1.542e+03 3.461e+03, threshold=2.079e+03, percent-clipped=2.0 2023-06-28 17:00:57,530 INFO [train.py:996] (0/4) Epoch 12, batch 15050, loss[loss=0.2121, simple_loss=0.2924, pruned_loss=0.0659, over 21295.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.3046, pruned_loss=0.07279, over 4273363.18 frames. ], batch size: 176, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:02:25,872 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=2103186.0, ans=0.0 2023-06-28 17:02:45,499 INFO [train.py:996] (0/4) Epoch 12, batch 15100, loss[loss=0.231, simple_loss=0.3295, pruned_loss=0.06631, over 20688.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.3063, pruned_loss=0.07228, over 4272776.50 frames. ], batch size: 608, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:02:57,724 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2103246.0, ans=0.0 2023-06-28 17:03:39,824 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.24 vs. limit=12.0 2023-06-28 17:03:49,647 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.39 vs. limit=22.5 2023-06-28 17:04:04,843 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.134e+02 7.864e+02 1.140e+03 1.681e+03 3.504e+03, threshold=2.280e+03, percent-clipped=13.0 2023-06-28 17:04:21,706 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=2103486.0, ans=0.2 2023-06-28 17:04:27,710 INFO [train.py:996] (0/4) Epoch 12, batch 15150, loss[loss=0.1849, simple_loss=0.2624, pruned_loss=0.05374, over 21740.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.303, pruned_loss=0.07302, over 4275667.12 frames. ], batch size: 102, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:05:36,640 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=2103726.0, ans=0.125 2023-06-28 17:06:10,677 INFO [train.py:996] (0/4) Epoch 12, batch 15200, loss[loss=0.1718, simple_loss=0.2767, pruned_loss=0.03345, over 20872.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2934, pruned_loss=0.06986, over 4273188.49 frames. ], batch size: 609, lr: 2.42e-03, grad_scale: 32.0 2023-06-28 17:06:11,963 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=10.66 vs. limit=10.0 2023-06-28 17:06:32,926 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2103906.0, ans=0.0 2023-06-28 17:07:34,246 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.768e+02 7.003e+02 9.713e+02 1.349e+03 2.577e+03, threshold=1.943e+03, percent-clipped=4.0 2023-06-28 17:07:52,327 INFO [train.py:996] (0/4) Epoch 12, batch 15250, loss[loss=0.2276, simple_loss=0.2995, pruned_loss=0.0779, over 21244.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2871, pruned_loss=0.06806, over 4262651.94 frames. ], batch size: 159, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:08:01,172 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=2104146.0, ans=0.0 2023-06-28 17:08:26,092 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2104206.0, ans=0.1 2023-06-28 17:09:13,293 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=2104326.0, ans=0.125 2023-06-28 17:09:29,811 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2104386.0, ans=0.125 2023-06-28 17:09:34,114 INFO [train.py:996] (0/4) Epoch 12, batch 15300, loss[loss=0.2461, simple_loss=0.3373, pruned_loss=0.07742, over 17005.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.29, pruned_loss=0.07066, over 4263772.57 frames. ], batch size: 60, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:10:40,550 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2104626.0, ans=0.125 2023-06-28 17:10:50,116 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=2104626.0, ans=0.0 2023-06-28 17:10:51,489 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2104626.0, ans=0.125 2023-06-28 17:11:01,322 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.828e+02 9.652e+02 1.202e+03 1.838e+03 3.602e+03, threshold=2.404e+03, percent-clipped=24.0 2023-06-28 17:11:17,469 INFO [train.py:996] (0/4) Epoch 12, batch 15350, loss[loss=0.2074, simple_loss=0.3076, pruned_loss=0.05359, over 21844.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.296, pruned_loss=0.07274, over 4267462.95 frames. ], batch size: 316, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:11:21,721 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.07 vs. limit=10.0 2023-06-28 17:12:11,169 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2104866.0, ans=0.125 2023-06-28 17:12:19,210 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2104926.0, ans=0.125 2023-06-28 17:12:56,052 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 17:12:56,094 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2105046.0, ans=0.0 2023-06-28 17:12:56,691 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=2105046.0, ans=10.0 2023-06-28 17:12:56,971 INFO [train.py:996] (0/4) Epoch 12, batch 15400, loss[loss=0.2054, simple_loss=0.2912, pruned_loss=0.05974, over 21806.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2979, pruned_loss=0.07141, over 4262498.62 frames. ], batch size: 332, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:12:57,465 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=2105046.0, ans=0.0 2023-06-28 17:13:15,091 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.53 vs. limit=15.0 2023-06-28 17:14:16,129 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.384e+02 7.580e+02 1.010e+03 1.519e+03 4.001e+03, threshold=2.021e+03, percent-clipped=6.0 2023-06-28 17:14:37,992 INFO [train.py:996] (0/4) Epoch 12, batch 15450, loss[loss=0.2048, simple_loss=0.2876, pruned_loss=0.06096, over 21770.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.2953, pruned_loss=0.07018, over 4249915.58 frames. ], batch size: 389, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:14:52,204 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=2105346.0, ans=0.125 2023-06-28 17:14:52,213 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2105346.0, ans=0.125 2023-06-28 17:15:15,282 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2105466.0, ans=0.125 2023-06-28 17:15:56,642 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.96 vs. limit=10.0 2023-06-28 17:16:20,924 INFO [train.py:996] (0/4) Epoch 12, batch 15500, loss[loss=0.2577, simple_loss=0.3386, pruned_loss=0.08843, over 21769.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2981, pruned_loss=0.0698, over 4259662.32 frames. ], batch size: 124, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:16:35,294 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.72 vs. limit=15.0 2023-06-28 17:17:14,325 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=2105766.0, ans=0.95 2023-06-28 17:17:26,013 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=2105826.0, ans=0.0 2023-06-28 17:17:27,455 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2105826.0, ans=0.125 2023-06-28 17:17:46,455 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.645e+02 8.205e+02 1.251e+03 1.746e+03 3.424e+03, threshold=2.502e+03, percent-clipped=13.0 2023-06-28 17:17:48,822 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=2105886.0, ans=0.0 2023-06-28 17:18:07,377 INFO [train.py:996] (0/4) Epoch 12, batch 15550, loss[loss=0.1816, simple_loss=0.2669, pruned_loss=0.04815, over 21728.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2977, pruned_loss=0.0683, over 4260733.51 frames. ], batch size: 124, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:18:16,900 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.61 vs. limit=15.0 2023-06-28 17:18:36,610 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2106006.0, ans=0.125 2023-06-28 17:18:49,959 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=2106066.0, ans=0.2 2023-06-28 17:18:54,959 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2106066.0, ans=0.125 2023-06-28 17:19:50,320 INFO [train.py:996] (0/4) Epoch 12, batch 15600, loss[loss=0.1802, simple_loss=0.2493, pruned_loss=0.05562, over 21666.00 frames. ], tot_loss[loss=0.2118, simple_loss=0.2911, pruned_loss=0.06621, over 4254391.88 frames. ], batch size: 282, lr: 2.42e-03, grad_scale: 32.0 2023-06-28 17:20:44,270 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.13 vs. limit=15.0 2023-06-28 17:21:08,512 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.963e+02 9.239e+02 1.318e+03 1.838e+03 4.350e+03, threshold=2.636e+03, percent-clipped=8.0 2023-06-28 17:21:29,556 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.62 vs. limit=8.0 2023-06-28 17:21:29,749 INFO [train.py:996] (0/4) Epoch 12, batch 15650, loss[loss=0.1846, simple_loss=0.2649, pruned_loss=0.05218, over 21624.00 frames. ], tot_loss[loss=0.2107, simple_loss=0.2905, pruned_loss=0.06539, over 4262321.20 frames. ], batch size: 298, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:21:35,410 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=2106546.0, ans=0.05 2023-06-28 17:22:02,965 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2106606.0, ans=0.125 2023-06-28 17:22:14,759 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 17:23:00,601 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.87 vs. limit=15.0 2023-06-28 17:23:12,650 INFO [train.py:996] (0/4) Epoch 12, batch 15700, loss[loss=0.2192, simple_loss=0.2894, pruned_loss=0.07444, over 21462.00 frames. ], tot_loss[loss=0.2072, simple_loss=0.287, pruned_loss=0.0637, over 4254762.80 frames. ], batch size: 441, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:23:15,518 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.26 vs. limit=15.0 2023-06-28 17:23:56,538 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2106966.0, ans=0.125 2023-06-28 17:23:58,202 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2106966.0, ans=0.125 2023-06-28 17:24:39,909 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.792e+02 8.496e+02 1.514e+03 2.181e+03 4.345e+03, threshold=3.028e+03, percent-clipped=16.0 2023-06-28 17:24:54,652 INFO [train.py:996] (0/4) Epoch 12, batch 15750, loss[loss=0.1908, simple_loss=0.2598, pruned_loss=0.0609, over 21397.00 frames. ], tot_loss[loss=0.206, simple_loss=0.2843, pruned_loss=0.06385, over 4258477.27 frames. ], batch size: 194, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:24:58,177 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2107146.0, ans=0.1 2023-06-28 17:25:32,111 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=2107206.0, ans=0.2 2023-06-28 17:25:42,145 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=2107266.0, ans=0.0 2023-06-28 17:26:14,991 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2107386.0, ans=0.125 2023-06-28 17:26:31,183 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2107386.0, ans=0.125 2023-06-28 17:26:35,204 INFO [train.py:996] (0/4) Epoch 12, batch 15800, loss[loss=0.1957, simple_loss=0.2635, pruned_loss=0.06394, over 21704.00 frames. ], tot_loss[loss=0.2025, simple_loss=0.2781, pruned_loss=0.06344, over 4255831.86 frames. ], batch size: 282, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:28:01,453 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.867e+02 7.167e+02 8.955e+02 1.687e+03 3.256e+03, threshold=1.791e+03, percent-clipped=1.0 2023-06-28 17:28:16,333 INFO [train.py:996] (0/4) Epoch 12, batch 15850, loss[loss=0.2673, simple_loss=0.3273, pruned_loss=0.1036, over 21425.00 frames. ], tot_loss[loss=0.2057, simple_loss=0.2797, pruned_loss=0.06583, over 4257560.05 frames. ], batch size: 471, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:29:11,591 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=2107866.0, ans=0.0 2023-06-28 17:29:47,347 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 17:29:53,387 INFO [train.py:996] (0/4) Epoch 12, batch 15900, loss[loss=0.1973, simple_loss=0.2676, pruned_loss=0.06352, over 21196.00 frames. ], tot_loss[loss=0.2059, simple_loss=0.2785, pruned_loss=0.06663, over 4256680.42 frames. ], batch size: 176, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:30:37,079 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=2108166.0, ans=0.125 2023-06-28 17:30:43,789 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2108166.0, ans=0.1 2023-06-28 17:30:58,938 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.02 vs. limit=10.0 2023-06-28 17:31:15,601 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.841e+02 7.381e+02 9.815e+02 1.486e+03 2.540e+03, threshold=1.963e+03, percent-clipped=11.0 2023-06-28 17:31:34,456 INFO [train.py:996] (0/4) Epoch 12, batch 15950, loss[loss=0.2115, simple_loss=0.3029, pruned_loss=0.06006, over 21611.00 frames. ], tot_loss[loss=0.2054, simple_loss=0.2811, pruned_loss=0.06483, over 4258635.10 frames. ], batch size: 263, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:31:46,816 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2108346.0, ans=0.0 2023-06-28 17:32:10,926 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2108406.0, ans=0.125 2023-06-28 17:32:33,769 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2108466.0, ans=0.125 2023-06-28 17:32:39,337 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.30 vs. limit=15.0 2023-06-28 17:33:11,196 INFO [train.py:996] (0/4) Epoch 12, batch 16000, loss[loss=0.2007, simple_loss=0.2973, pruned_loss=0.0521, over 21806.00 frames. ], tot_loss[loss=0.2026, simple_loss=0.2811, pruned_loss=0.0621, over 4264014.54 frames. ], batch size: 351, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:34:14,140 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=2108826.0, ans=0.0 2023-06-28 17:34:23,727 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=2108826.0, ans=0.2 2023-06-28 17:34:30,325 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2108826.0, ans=0.125 2023-06-28 17:34:33,362 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2108826.0, ans=0.0 2023-06-28 17:34:39,584 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.641e+02 6.614e+02 9.934e+02 1.443e+03 3.349e+03, threshold=1.987e+03, percent-clipped=8.0 2023-06-28 17:34:40,841 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.67 vs. limit=15.0 2023-06-28 17:34:52,876 INFO [train.py:996] (0/4) Epoch 12, batch 16050, loss[loss=0.1549, simple_loss=0.2356, pruned_loss=0.03711, over 21818.00 frames. ], tot_loss[loss=0.2015, simple_loss=0.2827, pruned_loss=0.06018, over 4263361.03 frames. ], batch size: 102, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:35:03,266 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2108946.0, ans=0.125 2023-06-28 17:35:08,022 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 17:35:35,492 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2109066.0, ans=0.1 2023-06-28 17:35:44,960 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=2109066.0, ans=0.125 2023-06-28 17:35:53,353 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2109126.0, ans=0.125 2023-06-28 17:36:28,137 INFO [train.py:996] (0/4) Epoch 12, batch 16100, loss[loss=0.1975, simple_loss=0.2748, pruned_loss=0.0601, over 21851.00 frames. ], tot_loss[loss=0.2043, simple_loss=0.2871, pruned_loss=0.0608, over 4271800.96 frames. ], batch size: 298, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:37:52,811 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.516e+02 1.028e+03 1.550e+03 2.496e+03 6.023e+03, threshold=3.100e+03, percent-clipped=39.0 2023-06-28 17:38:06,337 INFO [train.py:996] (0/4) Epoch 12, batch 16150, loss[loss=0.2309, simple_loss=0.3183, pruned_loss=0.0718, over 17499.00 frames. ], tot_loss[loss=0.2061, simple_loss=0.2862, pruned_loss=0.06294, over 4272516.73 frames. ], batch size: 60, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:38:37,318 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=2109606.0, ans=0.125 2023-06-28 17:39:18,939 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=2109726.0, ans=0.0 2023-06-28 17:39:24,087 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=2109726.0, ans=0.0 2023-06-28 17:39:29,524 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.13 vs. limit=15.0 2023-06-28 17:39:30,654 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2109786.0, ans=0.125 2023-06-28 17:39:36,421 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.18 vs. limit=15.0 2023-06-28 17:39:49,897 INFO [train.py:996] (0/4) Epoch 12, batch 16200, loss[loss=0.2366, simple_loss=0.3213, pruned_loss=0.07595, over 21773.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2931, pruned_loss=0.065, over 4276803.92 frames. ], batch size: 332, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:40:26,175 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2109906.0, ans=0.125 2023-06-28 17:40:35,292 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.79 vs. limit=15.0 2023-06-28 17:40:49,511 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2109966.0, ans=0.0 2023-06-28 17:41:06,484 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=2110026.0, ans=0.0 2023-06-28 17:41:21,145 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.167e+02 9.228e+02 1.472e+03 2.186e+03 5.217e+03, threshold=2.943e+03, percent-clipped=8.0 2023-06-28 17:41:25,456 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=2110086.0, ans=0.0 2023-06-28 17:41:39,807 INFO [train.py:996] (0/4) Epoch 12, batch 16250, loss[loss=0.1688, simple_loss=0.2504, pruned_loss=0.04357, over 21525.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.2933, pruned_loss=0.06503, over 4269089.80 frames. ], batch size: 230, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:42:47,229 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=2110326.0, ans=0.2 2023-06-28 17:42:47,245 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=2110326.0, ans=0.2 2023-06-28 17:43:09,995 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=2110386.0, ans=0.125 2023-06-28 17:43:22,710 INFO [train.py:996] (0/4) Epoch 12, batch 16300, loss[loss=0.1674, simple_loss=0.2576, pruned_loss=0.03865, over 21772.00 frames. ], tot_loss[loss=0.2053, simple_loss=0.287, pruned_loss=0.06182, over 4260518.85 frames. ], batch size: 282, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:43:40,381 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.99 vs. limit=15.0 2023-06-28 17:43:51,825 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 17:44:12,647 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=2110566.0, ans=0.2 2023-06-28 17:44:22,930 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2110626.0, ans=0.125 2023-06-28 17:44:39,500 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=2110626.0, ans=0.2 2023-06-28 17:44:47,671 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.729e+02 7.899e+02 1.103e+03 1.681e+03 3.393e+03, threshold=2.206e+03, percent-clipped=5.0 2023-06-28 17:45:06,103 INFO [train.py:996] (0/4) Epoch 12, batch 16350, loss[loss=0.2364, simple_loss=0.3075, pruned_loss=0.0827, over 21415.00 frames. ], tot_loss[loss=0.205, simple_loss=0.2854, pruned_loss=0.0623, over 4261550.99 frames. ], batch size: 211, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 17:45:40,809 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.06 vs. limit=15.0 2023-06-28 17:46:41,540 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2110986.0, ans=0.125 2023-06-28 17:46:53,878 INFO [train.py:996] (0/4) Epoch 12, batch 16400, loss[loss=0.2176, simple_loss=0.2878, pruned_loss=0.07371, over 21458.00 frames. ], tot_loss[loss=0.2077, simple_loss=0.2885, pruned_loss=0.06347, over 4263000.19 frames. ], batch size: 177, lr: 2.41e-03, grad_scale: 32.0 2023-06-28 17:47:03,231 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2111046.0, ans=0.1 2023-06-28 17:47:37,045 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2111166.0, ans=0.0 2023-06-28 17:48:15,214 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=2111286.0, ans=0.04949747468305833 2023-06-28 17:48:16,147 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.551e+02 7.002e+02 9.291e+02 1.321e+03 2.557e+03, threshold=1.858e+03, percent-clipped=4.0 2023-06-28 17:48:37,400 INFO [train.py:996] (0/4) Epoch 12, batch 16450, loss[loss=0.2159, simple_loss=0.2918, pruned_loss=0.06999, over 21845.00 frames. ], tot_loss[loss=0.2087, simple_loss=0.2882, pruned_loss=0.06459, over 4263691.72 frames. ], batch size: 124, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 17:48:46,306 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2111346.0, ans=0.125 2023-06-28 17:49:02,796 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=2111406.0, ans=0.2 2023-06-28 17:49:08,494 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.76 vs. limit=15.0 2023-06-28 17:50:20,642 INFO [train.py:996] (0/4) Epoch 12, batch 16500, loss[loss=0.2138, simple_loss=0.3191, pruned_loss=0.05429, over 20862.00 frames. ], tot_loss[loss=0.2081, simple_loss=0.2867, pruned_loss=0.06476, over 4257206.77 frames. ], batch size: 608, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 17:50:22,984 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2111646.0, ans=0.1 2023-06-28 17:50:34,944 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2111646.0, ans=0.125 2023-06-28 17:50:41,353 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2111706.0, ans=0.125 2023-06-28 17:51:20,290 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.94 vs. limit=10.0 2023-06-28 17:51:34,548 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2111826.0, ans=0.125 2023-06-28 17:51:36,000 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2111826.0, ans=0.125 2023-06-28 17:51:45,691 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=2111886.0, ans=0.125 2023-06-28 17:51:52,101 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.183e+02 7.579e+02 1.164e+03 1.772e+03 4.926e+03, threshold=2.328e+03, percent-clipped=21.0 2023-06-28 17:51:54,365 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=2111886.0, ans=0.0 2023-06-28 17:52:09,282 INFO [train.py:996] (0/4) Epoch 12, batch 16550, loss[loss=0.2897, simple_loss=0.3615, pruned_loss=0.1089, over 21422.00 frames. ], tot_loss[loss=0.206, simple_loss=0.2866, pruned_loss=0.06274, over 4261114.71 frames. ], batch size: 507, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 17:52:17,237 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.72 vs. limit=15.0 2023-06-28 17:52:18,761 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.76 vs. limit=15.0 2023-06-28 17:52:22,893 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-352000.pt 2023-06-28 17:52:26,867 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2112006.0, ans=0.0 2023-06-28 17:52:50,928 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.01 vs. limit=10.0 2023-06-28 17:52:51,898 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=2112066.0, ans=10.0 2023-06-28 17:53:22,108 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2112126.0, ans=0.125 2023-06-28 17:53:54,977 INFO [train.py:996] (0/4) Epoch 12, batch 16600, loss[loss=0.27, simple_loss=0.3693, pruned_loss=0.08532, over 21660.00 frames. ], tot_loss[loss=0.2125, simple_loss=0.2941, pruned_loss=0.06547, over 4263180.78 frames. ], batch size: 389, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 17:55:01,193 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=2112426.0, ans=0.125 2023-06-28 17:55:27,796 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 6.049e+02 7.685e+02 9.523e+02 1.400e+03 3.440e+03, threshold=1.905e+03, percent-clipped=5.0 2023-06-28 17:55:28,373 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=2112486.0, ans=0.125 2023-06-28 17:55:40,080 INFO [train.py:996] (0/4) Epoch 12, batch 16650, loss[loss=0.2183, simple_loss=0.2939, pruned_loss=0.07139, over 20615.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.3055, pruned_loss=0.069, over 4265081.92 frames. ], batch size: 607, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 17:56:32,370 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.60 vs. limit=12.0 2023-06-28 17:56:42,446 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=2112666.0, ans=0.2 2023-06-28 17:57:35,421 INFO [train.py:996] (0/4) Epoch 12, batch 16700, loss[loss=0.2018, simple_loss=0.2842, pruned_loss=0.05971, over 21770.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.3065, pruned_loss=0.06997, over 4265357.28 frames. ], batch size: 298, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 17:57:46,965 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.67 vs. limit=22.5 2023-06-28 17:58:24,199 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.89 vs. limit=22.5 2023-06-28 17:58:30,104 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=2112966.0, ans=0.125 2023-06-28 17:59:08,945 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.823e+02 8.945e+02 1.338e+03 1.942e+03 4.278e+03, threshold=2.675e+03, percent-clipped=28.0 2023-06-28 17:59:26,624 INFO [train.py:996] (0/4) Epoch 12, batch 16750, loss[loss=0.2341, simple_loss=0.3129, pruned_loss=0.07763, over 21576.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3082, pruned_loss=0.07234, over 4268968.36 frames. ], batch size: 230, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 17:59:36,627 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=2113146.0, ans=0.2 2023-06-28 17:59:57,533 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2113206.0, ans=0.125 2023-06-28 18:00:02,692 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2113206.0, ans=0.0 2023-06-28 18:00:38,965 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=2113326.0, ans=0.2 2023-06-28 18:00:43,926 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 18:01:00,687 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=2113386.0, ans=0.2 2023-06-28 18:01:11,609 INFO [train.py:996] (0/4) Epoch 12, batch 16800, loss[loss=0.2169, simple_loss=0.2979, pruned_loss=0.06795, over 21741.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.3121, pruned_loss=0.0726, over 4263111.26 frames. ], batch size: 389, lr: 2.41e-03, grad_scale: 32.0 2023-06-28 18:02:02,435 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.53 vs. limit=15.0 2023-06-28 18:02:28,391 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2113626.0, ans=0.125 2023-06-28 18:02:44,386 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.630e+02 9.200e+02 1.390e+03 2.563e+03 4.897e+03, threshold=2.780e+03, percent-clipped=19.0 2023-06-28 18:02:58,992 INFO [train.py:996] (0/4) Epoch 12, batch 16850, loss[loss=0.1928, simple_loss=0.2674, pruned_loss=0.05906, over 21678.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.308, pruned_loss=0.07216, over 4267730.83 frames. ], batch size: 230, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 18:03:18,034 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=2113806.0, ans=0.2 2023-06-28 18:03:40,821 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=2113866.0, ans=0.2 2023-06-28 18:03:47,130 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2113866.0, ans=0.125 2023-06-28 18:03:49,082 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2113866.0, ans=0.125 2023-06-28 18:03:50,806 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=2113866.0, ans=0.0 2023-06-28 18:04:12,961 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=2113926.0, ans=0.07 2023-06-28 18:04:40,769 INFO [train.py:996] (0/4) Epoch 12, batch 16900, loss[loss=0.2237, simple_loss=0.3027, pruned_loss=0.07238, over 21765.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.3026, pruned_loss=0.07044, over 4277062.80 frames. ], batch size: 441, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 18:04:51,302 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2114046.0, ans=0.1 2023-06-28 18:04:52,627 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2114046.0, ans=0.0 2023-06-28 18:05:07,505 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=2114106.0, ans=0.125 2023-06-28 18:05:47,660 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2114226.0, ans=0.0 2023-06-28 18:06:08,633 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.231e+02 8.316e+02 1.157e+03 1.734e+03 4.199e+03, threshold=2.313e+03, percent-clipped=8.0 2023-06-28 18:06:14,312 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2114286.0, ans=0.125 2023-06-28 18:06:21,745 INFO [train.py:996] (0/4) Epoch 12, batch 16950, loss[loss=0.1973, simple_loss=0.2697, pruned_loss=0.06244, over 21945.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2949, pruned_loss=0.06839, over 4281379.09 frames. ], batch size: 316, lr: 2.41e-03, grad_scale: 8.0 2023-06-28 18:06:43,982 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=2114406.0, ans=0.95 2023-06-28 18:07:18,996 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2114526.0, ans=0.125 2023-06-28 18:07:25,355 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2114526.0, ans=0.1 2023-06-28 18:07:40,584 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2114586.0, ans=0.125 2023-06-28 18:07:59,332 INFO [train.py:996] (0/4) Epoch 12, batch 17000, loss[loss=0.2161, simple_loss=0.2887, pruned_loss=0.07176, over 21963.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2912, pruned_loss=0.06867, over 4284018.60 frames. ], batch size: 333, lr: 2.41e-03, grad_scale: 8.0 2023-06-28 18:08:23,678 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=2114706.0, ans=0.125 2023-06-28 18:09:29,802 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.045e+02 1.097e+03 1.381e+03 1.822e+03 3.953e+03, threshold=2.762e+03, percent-clipped=12.0 2023-06-28 18:09:42,187 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.88 vs. limit=22.5 2023-06-28 18:09:42,673 INFO [train.py:996] (0/4) Epoch 12, batch 17050, loss[loss=0.2209, simple_loss=0.3092, pruned_loss=0.06628, over 21362.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.298, pruned_loss=0.07052, over 4283164.59 frames. ], batch size: 194, lr: 2.41e-03, grad_scale: 8.0 2023-06-28 18:09:44,927 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=2114946.0, ans=0.0 2023-06-28 18:09:52,024 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.70 vs. limit=15.0 2023-06-28 18:10:46,409 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff3.min_abs, batch_count=2115126.0, ans=0.2 2023-06-28 18:10:55,264 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.97 vs. limit=15.0 2023-06-28 18:11:15,795 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=2115186.0, ans=0.0 2023-06-28 18:11:18,379 INFO [train.py:996] (0/4) Epoch 12, batch 17100, loss[loss=0.2025, simple_loss=0.2702, pruned_loss=0.06739, over 21388.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.2973, pruned_loss=0.07121, over 4287951.10 frames. ], batch size: 176, lr: 2.41e-03, grad_scale: 8.0 2023-06-28 18:11:30,703 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=2115246.0, ans=0.125 2023-06-28 18:12:40,266 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=2115486.0, ans=0.05 2023-06-28 18:12:52,865 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.623e+02 7.702e+02 1.047e+03 1.626e+03 3.499e+03, threshold=2.095e+03, percent-clipped=2.0 2023-06-28 18:13:01,296 INFO [train.py:996] (0/4) Epoch 12, batch 17150, loss[loss=0.1912, simple_loss=0.2684, pruned_loss=0.05702, over 21472.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2935, pruned_loss=0.0707, over 4288201.88 frames. ], batch size: 131, lr: 2.41e-03, grad_scale: 8.0 2023-06-28 18:13:04,886 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=2115546.0, ans=0.125 2023-06-28 18:13:15,497 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=2115546.0, ans=0.2 2023-06-28 18:13:48,786 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.93 vs. limit=15.0 2023-06-28 18:14:44,898 INFO [train.py:996] (0/4) Epoch 12, batch 17200, loss[loss=0.2329, simple_loss=0.3057, pruned_loss=0.07998, over 21396.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2927, pruned_loss=0.06983, over 4292529.35 frames. ], batch size: 548, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 18:14:49,275 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2115846.0, ans=0.1 2023-06-28 18:14:57,470 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2115846.0, ans=0.0 2023-06-28 18:15:22,224 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2115906.0, ans=0.125 2023-06-28 18:15:28,726 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 18:15:39,312 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.36 vs. limit=22.5 2023-06-28 18:15:50,411 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2116026.0, ans=0.0 2023-06-28 18:16:20,214 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.974e+02 7.324e+02 9.389e+02 1.283e+03 2.769e+03, threshold=1.878e+03, percent-clipped=7.0 2023-06-28 18:16:33,061 INFO [train.py:996] (0/4) Epoch 12, batch 17250, loss[loss=0.2418, simple_loss=0.3174, pruned_loss=0.08315, over 21791.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.2966, pruned_loss=0.07218, over 4288474.99 frames. ], batch size: 441, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 18:16:47,688 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.22 vs. limit=12.0 2023-06-28 18:17:42,865 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=2116326.0, ans=0.0 2023-06-28 18:18:07,611 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2116386.0, ans=0.125 2023-06-28 18:18:15,702 INFO [train.py:996] (0/4) Epoch 12, batch 17300, loss[loss=0.2483, simple_loss=0.3213, pruned_loss=0.08768, over 21835.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.305, pruned_loss=0.07537, over 4290325.22 frames. ], batch size: 118, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 18:18:38,473 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2116506.0, ans=0.1 2023-06-28 18:18:50,191 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2116506.0, ans=0.125 2023-06-28 18:19:03,243 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2116566.0, ans=0.125 2023-06-28 18:19:25,669 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 18:19:47,184 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2116686.0, ans=0.1 2023-06-28 18:19:48,135 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.078e+02 8.589e+02 1.215e+03 1.645e+03 3.725e+03, threshold=2.430e+03, percent-clipped=16.0 2023-06-28 18:19:59,796 INFO [train.py:996] (0/4) Epoch 12, batch 17350, loss[loss=0.2406, simple_loss=0.3319, pruned_loss=0.07465, over 21475.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3061, pruned_loss=0.07543, over 4281127.41 frames. ], batch size: 471, lr: 2.41e-03, grad_scale: 8.0 2023-06-28 18:20:15,804 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=2116806.0, ans=0.0 2023-06-28 18:21:09,812 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2116926.0, ans=0.1 2023-06-28 18:21:36,644 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2116986.0, ans=0.125 2023-06-28 18:21:42,610 INFO [train.py:996] (0/4) Epoch 12, batch 17400, loss[loss=0.223, simple_loss=0.3128, pruned_loss=0.06655, over 21641.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.3021, pruned_loss=0.07172, over 4280942.96 frames. ], batch size: 389, lr: 2.41e-03, grad_scale: 8.0 2023-06-28 18:21:43,166 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2117046.0, ans=0.125 2023-06-28 18:22:47,716 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2117226.0, ans=0.125 2023-06-28 18:23:13,922 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.979e+02 8.447e+02 1.378e+03 1.932e+03 4.918e+03, threshold=2.756e+03, percent-clipped=14.0 2023-06-28 18:23:20,612 INFO [train.py:996] (0/4) Epoch 12, batch 17450, loss[loss=0.2069, simple_loss=0.3019, pruned_loss=0.05593, over 21229.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.297, pruned_loss=0.06992, over 4274876.54 frames. ], batch size: 548, lr: 2.41e-03, grad_scale: 8.0 2023-06-28 18:23:44,424 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=2117406.0, ans=0.2 2023-06-28 18:24:36,130 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2117526.0, ans=0.125 2023-06-28 18:24:57,149 INFO [train.py:996] (0/4) Epoch 12, batch 17500, loss[loss=0.1989, simple_loss=0.2714, pruned_loss=0.06319, over 21826.00 frames. ], tot_loss[loss=0.213, simple_loss=0.2921, pruned_loss=0.06692, over 4272794.57 frames. ], batch size: 247, lr: 2.41e-03, grad_scale: 8.0 2023-06-28 18:26:05,625 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2117826.0, ans=0.125 2023-06-28 18:26:15,583 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=9.79 vs. limit=15.0 2023-06-28 18:26:16,704 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2117826.0, ans=0.125 2023-06-28 18:26:23,333 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2117886.0, ans=0.0 2023-06-28 18:26:30,508 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.515e+02 7.082e+02 9.304e+02 1.343e+03 2.877e+03, threshold=1.861e+03, percent-clipped=1.0 2023-06-28 18:26:36,965 INFO [train.py:996] (0/4) Epoch 12, batch 17550, loss[loss=0.2018, simple_loss=0.2951, pruned_loss=0.05421, over 21417.00 frames. ], tot_loss[loss=0.2119, simple_loss=0.2927, pruned_loss=0.06558, over 4271721.00 frames. ], batch size: 131, lr: 2.41e-03, grad_scale: 8.0 2023-06-28 18:26:39,767 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.51 vs. limit=15.0 2023-06-28 18:26:55,299 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=2117946.0, ans=0.0 2023-06-28 18:27:08,785 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.51 vs. limit=15.0 2023-06-28 18:27:13,014 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2118066.0, ans=0.125 2023-06-28 18:27:55,291 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=2118126.0, ans=0.0 2023-06-28 18:28:18,265 INFO [train.py:996] (0/4) Epoch 12, batch 17600, loss[loss=0.2184, simple_loss=0.2967, pruned_loss=0.07006, over 20739.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2958, pruned_loss=0.06535, over 4268114.20 frames. ], batch size: 607, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 18:28:33,906 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=2118246.0, ans=0.125 2023-06-28 18:28:44,670 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.38 vs. limit=15.0 2023-06-28 18:28:50,456 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2118306.0, ans=0.125 2023-06-28 18:29:02,631 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=2118366.0, ans=0.0 2023-06-28 18:29:12,651 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_na.min_abs, batch_count=2118366.0, ans=0.02 2023-06-28 18:29:34,920 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=2118426.0, ans=0.0 2023-06-28 18:29:45,370 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2118486.0, ans=0.125 2023-06-28 18:29:51,278 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.265e+02 8.846e+02 1.006e+03 1.368e+03 3.785e+03, threshold=2.012e+03, percent-clipped=6.0 2023-06-28 18:30:03,167 INFO [train.py:996] (0/4) Epoch 12, batch 17650, loss[loss=0.2212, simple_loss=0.3096, pruned_loss=0.06638, over 20015.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.296, pruned_loss=0.06628, over 4266145.76 frames. ], batch size: 702, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 18:30:17,712 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=2118546.0, ans=0.125 2023-06-28 18:30:50,355 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=2118666.0, ans=0.025 2023-06-28 18:31:46,613 INFO [train.py:996] (0/4) Epoch 12, batch 17700, loss[loss=0.1633, simple_loss=0.2378, pruned_loss=0.04443, over 21487.00 frames. ], tot_loss[loss=0.208, simple_loss=0.2889, pruned_loss=0.0636, over 4268306.04 frames. ], batch size: 212, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 18:31:58,966 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=2118846.0, ans=0.125 2023-06-28 18:32:35,936 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=2118966.0, ans=0.125 2023-06-28 18:32:43,244 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2118966.0, ans=0.0 2023-06-28 18:33:19,191 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.445e+02 8.687e+02 1.297e+03 2.273e+03 4.187e+03, threshold=2.595e+03, percent-clipped=29.0 2023-06-28 18:33:26,138 INFO [train.py:996] (0/4) Epoch 12, batch 17750, loss[loss=0.2296, simple_loss=0.3139, pruned_loss=0.07265, over 21721.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2961, pruned_loss=0.06649, over 4266418.09 frames. ], batch size: 298, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 18:33:28,461 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2119146.0, ans=0.125 2023-06-28 18:33:59,634 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2119206.0, ans=0.125 2023-06-28 18:33:59,649 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2119206.0, ans=0.125 2023-06-28 18:34:23,688 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=2119266.0, ans=0.2 2023-06-28 18:34:42,527 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2119326.0, ans=0.125 2023-06-28 18:35:17,608 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2119386.0, ans=0.125 2023-06-28 18:35:20,477 INFO [train.py:996] (0/4) Epoch 12, batch 17800, loss[loss=0.1861, simple_loss=0.2631, pruned_loss=0.05453, over 21570.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.2948, pruned_loss=0.06568, over 4269042.78 frames. ], batch size: 112, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 18:36:09,745 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=2119566.0, ans=0.125 2023-06-28 18:36:52,583 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.029e+02 8.129e+02 1.136e+03 1.993e+03 4.758e+03, threshold=2.272e+03, percent-clipped=17.0 2023-06-28 18:36:59,634 INFO [train.py:996] (0/4) Epoch 12, batch 17850, loss[loss=0.253, simple_loss=0.3197, pruned_loss=0.09319, over 21410.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.2963, pruned_loss=0.06653, over 4268494.71 frames. ], batch size: 194, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 18:37:19,696 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2119806.0, ans=0.0 2023-06-28 18:37:53,266 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2119866.0, ans=0.1 2023-06-28 18:38:40,365 INFO [train.py:996] (0/4) Epoch 12, batch 17900, loss[loss=0.2524, simple_loss=0.3448, pruned_loss=0.07998, over 21831.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.3012, pruned_loss=0.06887, over 4271890.90 frames. ], batch size: 371, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 18:38:54,750 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2120046.0, ans=0.125 2023-06-28 18:38:54,753 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=2120046.0, ans=0.04949747468305833 2023-06-28 18:39:16,436 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=2120106.0, ans=0.0 2023-06-28 18:39:47,100 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.85 vs. limit=15.0 2023-06-28 18:40:08,732 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.79 vs. limit=15.0 2023-06-28 18:40:12,405 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.160e+02 9.224e+02 1.391e+03 2.083e+03 4.254e+03, threshold=2.783e+03, percent-clipped=21.0 2023-06-28 18:40:19,126 INFO [train.py:996] (0/4) Epoch 12, batch 17950, loss[loss=0.1921, simple_loss=0.2904, pruned_loss=0.04696, over 21800.00 frames. ], tot_loss[loss=0.216, simple_loss=0.3007, pruned_loss=0.06562, over 4273301.13 frames. ], batch size: 371, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 18:40:55,927 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 18:41:32,303 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2120526.0, ans=0.0 2023-06-28 18:41:35,466 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2120526.0, ans=0.125 2023-06-28 18:41:53,907 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2120586.0, ans=0.0 2023-06-28 18:41:56,542 INFO [train.py:996] (0/4) Epoch 12, batch 18000, loss[loss=0.1729, simple_loss=0.2473, pruned_loss=0.04919, over 21627.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2939, pruned_loss=0.0642, over 4272249.43 frames. ], batch size: 332, lr: 2.41e-03, grad_scale: 32.0 2023-06-28 18:41:56,543 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-28 18:42:16,415 INFO [train.py:1028] (0/4) Epoch 12, validation: loss=0.2604, simple_loss=0.3527, pruned_loss=0.08401, over 1796401.00 frames. 2023-06-28 18:42:16,416 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23714MB 2023-06-28 18:42:23,020 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.93 vs. limit=6.0 2023-06-28 18:42:34,218 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2120706.0, ans=0.1 2023-06-28 18:42:45,408 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=2120706.0, ans=0.0 2023-06-28 18:42:54,157 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2120706.0, ans=0.125 2023-06-28 18:43:14,546 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.53 vs. limit=6.0 2023-06-28 18:43:20,978 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2120826.0, ans=0.1 2023-06-28 18:43:27,433 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=2120826.0, ans=0.125 2023-06-28 18:43:51,612 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.00 vs. limit=15.0 2023-06-28 18:43:55,005 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.231e+02 7.241e+02 9.176e+02 1.211e+03 3.223e+03, threshold=1.835e+03, percent-clipped=1.0 2023-06-28 18:43:59,012 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=2120946.0, ans=0.05 2023-06-28 18:44:00,007 INFO [train.py:996] (0/4) Epoch 12, batch 18050, loss[loss=0.2162, simple_loss=0.3014, pruned_loss=0.06554, over 21406.00 frames. ], tot_loss[loss=0.2074, simple_loss=0.2881, pruned_loss=0.06332, over 4271519.81 frames. ], batch size: 131, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 18:45:11,048 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.71 vs. limit=15.0 2023-06-28 18:45:17,390 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.25 vs. limit=15.0 2023-06-28 18:45:33,389 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=2121186.0, ans=0.2 2023-06-28 18:45:33,479 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2121186.0, ans=0.125 2023-06-28 18:45:44,359 INFO [train.py:996] (0/4) Epoch 12, batch 18100, loss[loss=0.2406, simple_loss=0.3349, pruned_loss=0.07313, over 21634.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2919, pruned_loss=0.06527, over 4261394.81 frames. ], batch size: 414, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 18:45:46,611 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2121246.0, ans=0.1 2023-06-28 18:46:21,580 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=2121306.0, ans=0.09899494936611666 2023-06-28 18:46:53,655 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.24 vs. limit=22.5 2023-06-28 18:47:00,787 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=2121426.0, ans=0.0 2023-06-28 18:47:23,005 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.434e+02 8.761e+02 1.193e+03 1.712e+03 3.705e+03, threshold=2.386e+03, percent-clipped=21.0 2023-06-28 18:47:26,576 INFO [train.py:996] (0/4) Epoch 12, batch 18150, loss[loss=0.1984, simple_loss=0.276, pruned_loss=0.06037, over 21460.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2942, pruned_loss=0.06565, over 4265185.98 frames. ], batch size: 195, lr: 2.41e-03, grad_scale: 8.0 2023-06-28 18:48:03,784 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2121606.0, ans=0.125 2023-06-28 18:48:05,538 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2121606.0, ans=0.1 2023-06-28 18:48:15,521 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=2121666.0, ans=0.2 2023-06-28 18:48:29,051 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2121666.0, ans=0.125 2023-06-28 18:48:45,517 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.95 vs. limit=15.0 2023-06-28 18:48:59,985 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2121786.0, ans=0.1 2023-06-28 18:49:08,823 INFO [train.py:996] (0/4) Epoch 12, batch 18200, loss[loss=0.2006, simple_loss=0.2638, pruned_loss=0.06869, over 21392.00 frames. ], tot_loss[loss=0.21, simple_loss=0.2888, pruned_loss=0.06565, over 4274004.27 frames. ], batch size: 144, lr: 2.41e-03, grad_scale: 8.0 2023-06-28 18:49:55,277 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2121966.0, ans=0.0 2023-06-28 18:49:59,995 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2121966.0, ans=0.125 2023-06-28 18:50:18,721 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.22 vs. limit=15.0 2023-06-28 18:50:21,287 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2122026.0, ans=0.125 2023-06-28 18:50:25,746 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=2122086.0, ans=0.2 2023-06-28 18:50:44,438 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.871e+02 6.470e+02 8.191e+02 1.481e+03 3.644e+03, threshold=1.638e+03, percent-clipped=8.0 2023-06-28 18:50:48,141 INFO [train.py:996] (0/4) Epoch 12, batch 18250, loss[loss=0.1828, simple_loss=0.2523, pruned_loss=0.05661, over 21730.00 frames. ], tot_loss[loss=0.2041, simple_loss=0.2815, pruned_loss=0.06336, over 4270824.19 frames. ], batch size: 247, lr: 2.41e-03, grad_scale: 8.0 2023-06-28 18:50:49,484 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.17 vs. limit=6.0 2023-06-28 18:50:58,389 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=2122146.0, ans=0.2 2023-06-28 18:52:25,406 INFO [train.py:996] (0/4) Epoch 12, batch 18300, loss[loss=0.233, simple_loss=0.3025, pruned_loss=0.08175, over 21758.00 frames. ], tot_loss[loss=0.2043, simple_loss=0.2821, pruned_loss=0.06326, over 4259149.26 frames. ], batch size: 112, lr: 2.41e-03, grad_scale: 8.0 2023-06-28 18:52:34,051 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2122446.0, ans=0.125 2023-06-28 18:53:29,879 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2122626.0, ans=0.125 2023-06-28 18:53:33,821 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.01 vs. limit=15.0 2023-06-28 18:53:47,780 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2122686.0, ans=0.125 2023-06-28 18:53:49,317 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=2122686.0, ans=0.125 2023-06-28 18:54:03,762 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.548e+02 1.033e+03 1.487e+03 2.193e+03 4.357e+03, threshold=2.975e+03, percent-clipped=43.0 2023-06-28 18:54:06,763 INFO [train.py:996] (0/4) Epoch 12, batch 18350, loss[loss=0.19, simple_loss=0.2551, pruned_loss=0.06243, over 21164.00 frames. ], tot_loss[loss=0.208, simple_loss=0.2885, pruned_loss=0.06375, over 4258138.21 frames. ], batch size: 176, lr: 2.41e-03, grad_scale: 8.0 2023-06-28 18:54:27,775 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2122806.0, ans=0.1 2023-06-28 18:54:31,164 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=2122806.0, ans=0.125 2023-06-28 18:55:12,461 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2122926.0, ans=0.125 2023-06-28 18:55:31,123 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.07 vs. limit=22.5 2023-06-28 18:55:49,923 INFO [train.py:996] (0/4) Epoch 12, batch 18400, loss[loss=0.2063, simple_loss=0.2788, pruned_loss=0.06693, over 21298.00 frames. ], tot_loss[loss=0.2029, simple_loss=0.2829, pruned_loss=0.06144, over 4257956.61 frames. ], batch size: 143, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 18:55:52,160 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2123046.0, ans=0.125 2023-06-28 18:56:20,646 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=2123106.0, ans=0.2 2023-06-28 18:56:21,142 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.49 vs. limit=22.5 2023-06-28 18:56:26,790 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2123106.0, ans=0.0 2023-06-28 18:56:36,736 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=2123166.0, ans=0.2 2023-06-28 18:56:41,457 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=2123166.0, ans=0.05 2023-06-28 18:57:21,759 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=2123286.0, ans=0.2 2023-06-28 18:57:22,746 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.644e+02 6.567e+02 9.671e+02 1.816e+03 3.680e+03, threshold=1.934e+03, percent-clipped=2.0 2023-06-28 18:57:26,092 INFO [train.py:996] (0/4) Epoch 12, batch 18450, loss[loss=0.1695, simple_loss=0.2506, pruned_loss=0.04416, over 21689.00 frames. ], tot_loss[loss=0.1982, simple_loss=0.2795, pruned_loss=0.05842, over 4250010.44 frames. ], batch size: 282, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 18:59:07,201 INFO [train.py:996] (0/4) Epoch 12, batch 18500, loss[loss=0.1732, simple_loss=0.2446, pruned_loss=0.05084, over 21825.00 frames. ], tot_loss[loss=0.1952, simple_loss=0.2755, pruned_loss=0.05743, over 4244550.10 frames. ], batch size: 352, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 18:59:37,543 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.91 vs. limit=15.0 2023-06-28 19:00:45,281 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.695e+02 8.087e+02 1.310e+03 2.007e+03 4.820e+03, threshold=2.620e+03, percent-clipped=25.0 2023-06-28 19:00:48,719 INFO [train.py:996] (0/4) Epoch 12, batch 18550, loss[loss=0.1976, simple_loss=0.2714, pruned_loss=0.0619, over 21782.00 frames. ], tot_loss[loss=0.193, simple_loss=0.2726, pruned_loss=0.05673, over 4237424.34 frames. ], batch size: 351, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 19:00:54,569 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2123946.0, ans=0.125 2023-06-28 19:01:19,270 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=2124006.0, ans=0.05 2023-06-28 19:01:30,542 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=2124006.0, ans=0.0 2023-06-28 19:01:54,597 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=2124126.0, ans=0.035 2023-06-28 19:02:11,249 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2124126.0, ans=0.1 2023-06-28 19:02:23,173 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=2124186.0, ans=0.2 2023-06-28 19:02:32,399 INFO [train.py:996] (0/4) Epoch 12, batch 18600, loss[loss=0.2343, simple_loss=0.3209, pruned_loss=0.0738, over 21699.00 frames. ], tot_loss[loss=0.1948, simple_loss=0.2723, pruned_loss=0.05868, over 4220226.83 frames. ], batch size: 415, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 19:03:04,172 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=2124306.0, ans=0.2 2023-06-28 19:03:10,637 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=2124306.0, ans=0.2 2023-06-28 19:03:37,801 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.06 vs. limit=15.0 2023-06-28 19:03:50,703 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.59 vs. limit=5.0 2023-06-28 19:04:12,043 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.057e+02 7.816e+02 1.103e+03 1.650e+03 3.069e+03, threshold=2.205e+03, percent-clipped=1.0 2023-06-28 19:04:13,776 INFO [train.py:996] (0/4) Epoch 12, batch 18650, loss[loss=0.1902, simple_loss=0.2598, pruned_loss=0.06026, over 21679.00 frames. ], tot_loss[loss=0.1939, simple_loss=0.2711, pruned_loss=0.05837, over 4230587.18 frames. ], batch size: 333, lr: 2.41e-03, grad_scale: 8.0 2023-06-28 19:05:11,527 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=2124666.0, ans=10.0 2023-06-28 19:05:12,087 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.94 vs. limit=22.5 2023-06-28 19:05:35,712 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=2124726.0, ans=0.2 2023-06-28 19:05:48,999 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=2124786.0, ans=0.0 2023-06-28 19:05:55,227 INFO [train.py:996] (0/4) Epoch 12, batch 18700, loss[loss=0.1727, simple_loss=0.2634, pruned_loss=0.04103, over 21604.00 frames. ], tot_loss[loss=0.194, simple_loss=0.2689, pruned_loss=0.05956, over 4219715.70 frames. ], batch size: 230, lr: 2.41e-03, grad_scale: 8.0 2023-06-28 19:06:02,768 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.90 vs. limit=15.0 2023-06-28 19:06:59,934 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2125026.0, ans=0.125 2023-06-28 19:07:16,589 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=2125026.0, ans=0.05 2023-06-28 19:07:35,705 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.958e+02 6.838e+02 8.648e+02 1.289e+03 2.694e+03, threshold=1.730e+03, percent-clipped=5.0 2023-06-28 19:07:37,330 INFO [train.py:996] (0/4) Epoch 12, batch 18750, loss[loss=0.1947, simple_loss=0.2603, pruned_loss=0.06455, over 21263.00 frames. ], tot_loss[loss=0.1985, simple_loss=0.2723, pruned_loss=0.06234, over 4242595.17 frames. ], batch size: 144, lr: 2.41e-03, grad_scale: 8.0 2023-06-28 19:08:28,430 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2125266.0, ans=0.1 2023-06-28 19:09:19,270 INFO [train.py:996] (0/4) Epoch 12, batch 18800, loss[loss=0.2053, simple_loss=0.3128, pruned_loss=0.04886, over 20781.00 frames. ], tot_loss[loss=0.2028, simple_loss=0.2787, pruned_loss=0.06346, over 4240505.18 frames. ], batch size: 607, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 19:10:22,447 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=2125626.0, ans=0.07 2023-06-28 19:10:49,845 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2125686.0, ans=0.1 2023-06-28 19:10:56,979 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.67 vs. limit=15.0 2023-06-28 19:10:58,898 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.378e+02 7.621e+02 1.255e+03 1.956e+03 3.877e+03, threshold=2.510e+03, percent-clipped=29.0 2023-06-28 19:11:00,574 INFO [train.py:996] (0/4) Epoch 12, batch 18850, loss[loss=0.176, simple_loss=0.2591, pruned_loss=0.04639, over 21505.00 frames. ], tot_loss[loss=0.1975, simple_loss=0.2758, pruned_loss=0.05956, over 4241837.08 frames. ], batch size: 194, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 19:11:28,001 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=2125806.0, ans=0.0 2023-06-28 19:12:17,379 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.35 vs. limit=10.0 2023-06-28 19:12:18,190 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=2125926.0, ans=0.07 2023-06-28 19:12:40,422 INFO [train.py:996] (0/4) Epoch 12, batch 18900, loss[loss=0.251, simple_loss=0.2907, pruned_loss=0.1057, over 21439.00 frames. ], tot_loss[loss=0.1955, simple_loss=0.2721, pruned_loss=0.05948, over 4250168.61 frames. ], batch size: 508, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 19:13:47,572 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=2126226.0, ans=0.125 2023-06-28 19:13:55,841 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=2126226.0, ans=0.125 2023-06-28 19:14:02,575 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=2126286.0, ans=0.2 2023-06-28 19:14:14,964 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.285e+02 7.566e+02 1.259e+03 1.840e+03 2.966e+03, threshold=2.518e+03, percent-clipped=3.0 2023-06-28 19:14:16,558 INFO [train.py:996] (0/4) Epoch 12, batch 18950, loss[loss=0.2148, simple_loss=0.2943, pruned_loss=0.06768, over 21765.00 frames. ], tot_loss[loss=0.1971, simple_loss=0.2715, pruned_loss=0.06136, over 4258348.43 frames. ], batch size: 247, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 19:14:27,826 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.28 vs. limit=15.0 2023-06-28 19:14:36,210 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.59 vs. limit=15.0 2023-06-28 19:14:47,521 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2126406.0, ans=0.1 2023-06-28 19:15:11,716 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.85 vs. limit=15.0 2023-06-28 19:15:30,773 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2126526.0, ans=0.1 2023-06-28 19:15:43,928 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=2126586.0, ans=0.04949747468305833 2023-06-28 19:15:47,383 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=2126586.0, ans=0.0 2023-06-28 19:15:52,243 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=2126586.0, ans=0.5 2023-06-28 19:15:55,110 INFO [train.py:996] (0/4) Epoch 12, batch 19000, loss[loss=0.2419, simple_loss=0.3328, pruned_loss=0.0755, over 21755.00 frames. ], tot_loss[loss=0.2047, simple_loss=0.2832, pruned_loss=0.06316, over 4270978.01 frames. ], batch size: 124, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 19:16:00,545 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2126646.0, ans=0.125 2023-06-28 19:16:19,585 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=2126706.0, ans=0.0 2023-06-28 19:16:27,851 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2126706.0, ans=0.125 2023-06-28 19:16:40,906 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2126766.0, ans=0.125 2023-06-28 19:16:47,481 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=2126766.0, ans=0.125 2023-06-28 19:17:32,189 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.360e+02 7.287e+02 9.721e+02 1.319e+03 3.703e+03, threshold=1.944e+03, percent-clipped=9.0 2023-06-28 19:17:33,787 INFO [train.py:996] (0/4) Epoch 12, batch 19050, loss[loss=0.2062, simple_loss=0.2768, pruned_loss=0.06785, over 21816.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.2876, pruned_loss=0.06545, over 4275551.35 frames. ], batch size: 282, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 19:18:17,071 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2127006.0, ans=0.0 2023-06-28 19:18:51,098 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.18 vs. limit=10.0 2023-06-28 19:19:16,218 INFO [train.py:996] (0/4) Epoch 12, batch 19100, loss[loss=0.1987, simple_loss=0.2714, pruned_loss=0.06296, over 21757.00 frames. ], tot_loss[loss=0.2097, simple_loss=0.2861, pruned_loss=0.06666, over 4277729.48 frames. ], batch size: 112, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 19:19:35,907 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2127246.0, ans=0.1 2023-06-28 19:19:54,163 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=2127306.0, ans=0.0 2023-06-28 19:20:22,978 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.57 vs. limit=15.0 2023-06-28 19:20:39,570 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=2127486.0, ans=0.025 2023-06-28 19:20:52,001 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.94 vs. limit=15.0 2023-06-28 19:20:53,029 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2127486.0, ans=0.0 2023-06-28 19:20:58,311 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=2127486.0, ans=0.0 2023-06-28 19:21:01,438 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.479e+02 7.973e+02 1.169e+03 1.755e+03 3.524e+03, threshold=2.338e+03, percent-clipped=19.0 2023-06-28 19:21:03,171 INFO [train.py:996] (0/4) Epoch 12, batch 19150, loss[loss=0.2245, simple_loss=0.3147, pruned_loss=0.06711, over 21429.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.2877, pruned_loss=0.06702, over 4274727.29 frames. ], batch size: 211, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 19:21:21,980 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 19:22:28,030 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2127786.0, ans=0.1 2023-06-28 19:22:36,692 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=2127786.0, ans=0.2 2023-06-28 19:22:38,933 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.25 vs. limit=15.0 2023-06-28 19:22:51,257 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2127786.0, ans=0.1 2023-06-28 19:22:51,363 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2127786.0, ans=0.0 2023-06-28 19:22:53,938 INFO [train.py:996] (0/4) Epoch 12, batch 19200, loss[loss=0.2307, simple_loss=0.3333, pruned_loss=0.06406, over 21711.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.2975, pruned_loss=0.06791, over 4270816.74 frames. ], batch size: 298, lr: 2.41e-03, grad_scale: 32.0 2023-06-28 19:23:26,871 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.58 vs. limit=22.5 2023-06-28 19:23:41,191 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=2127966.0, ans=0.2 2023-06-28 19:23:41,761 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.27 vs. limit=15.0 2023-06-28 19:24:02,658 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2128026.0, ans=0.1 2023-06-28 19:24:07,580 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2128026.0, ans=0.0 2023-06-28 19:24:36,177 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.048e+02 8.519e+02 1.165e+03 1.659e+03 4.865e+03, threshold=2.330e+03, percent-clipped=13.0 2023-06-28 19:24:36,207 INFO [train.py:996] (0/4) Epoch 12, batch 19250, loss[loss=0.1773, simple_loss=0.27, pruned_loss=0.0423, over 21784.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2987, pruned_loss=0.06418, over 4276162.90 frames. ], batch size: 282, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 19:26:18,596 INFO [train.py:996] (0/4) Epoch 12, batch 19300, loss[loss=0.1646, simple_loss=0.2564, pruned_loss=0.03644, over 21767.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.2952, pruned_loss=0.06326, over 4281754.85 frames. ], batch size: 282, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 19:26:55,059 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.88 vs. limit=22.5 2023-06-28 19:27:05,208 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.97 vs. limit=15.0 2023-06-28 19:27:39,988 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=2128686.0, ans=0.5 2023-06-28 19:27:57,281 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.967e+02 7.718e+02 1.195e+03 1.796e+03 4.248e+03, threshold=2.390e+03, percent-clipped=9.0 2023-06-28 19:27:57,312 INFO [train.py:996] (0/4) Epoch 12, batch 19350, loss[loss=0.1876, simple_loss=0.278, pruned_loss=0.04861, over 21750.00 frames. ], tot_loss[loss=0.2052, simple_loss=0.29, pruned_loss=0.06023, over 4286601.26 frames. ], batch size: 316, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 19:28:15,556 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=2128746.0, ans=0.125 2023-06-28 19:28:17,472 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=2128806.0, ans=0.04949747468305833 2023-06-28 19:28:39,019 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2128866.0, ans=0.125 2023-06-28 19:28:55,046 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2128926.0, ans=0.1 2023-06-28 19:29:03,034 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2128926.0, ans=0.125 2023-06-28 19:29:17,823 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2128986.0, ans=0.125 2023-06-28 19:29:36,914 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=2129046.0, ans=0.125 2023-06-28 19:29:37,799 INFO [train.py:996] (0/4) Epoch 12, batch 19400, loss[loss=0.2204, simple_loss=0.2999, pruned_loss=0.0704, over 21823.00 frames. ], tot_loss[loss=0.2033, simple_loss=0.2872, pruned_loss=0.05972, over 4288433.95 frames. ], batch size: 391, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 19:30:32,851 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2129166.0, ans=0.125 2023-06-28 19:30:47,920 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2129226.0, ans=0.125 2023-06-28 19:31:17,178 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=2129286.0, ans=0.0 2023-06-28 19:31:19,983 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.788e+02 6.972e+02 8.917e+02 1.265e+03 3.232e+03, threshold=1.783e+03, percent-clipped=5.0 2023-06-28 19:31:20,018 INFO [train.py:996] (0/4) Epoch 12, batch 19450, loss[loss=0.241, simple_loss=0.2814, pruned_loss=0.1003, over 21554.00 frames. ], tot_loss[loss=0.2034, simple_loss=0.2842, pruned_loss=0.06126, over 4289465.66 frames. ], batch size: 511, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 19:31:33,493 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=2129346.0, ans=0.125 2023-06-28 19:31:45,362 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2129406.0, ans=0.125 2023-06-28 19:31:50,369 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2129406.0, ans=0.125 2023-06-28 19:32:26,212 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2129526.0, ans=0.1 2023-06-28 19:33:02,592 INFO [train.py:996] (0/4) Epoch 12, batch 19500, loss[loss=0.1777, simple_loss=0.2356, pruned_loss=0.05993, over 21932.00 frames. ], tot_loss[loss=0.2012, simple_loss=0.2786, pruned_loss=0.06184, over 4282768.61 frames. ], batch size: 103, lr: 2.40e-03, grad_scale: 8.0 2023-06-28 19:33:54,485 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2129766.0, ans=0.125 2023-06-28 19:34:07,628 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2129826.0, ans=0.0 2023-06-28 19:34:14,078 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=2129826.0, ans=0.125 2023-06-28 19:34:25,227 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=2129886.0, ans=0.125 2023-06-28 19:34:43,706 INFO [train.py:996] (0/4) Epoch 12, batch 19550, loss[loss=0.2146, simple_loss=0.3107, pruned_loss=0.05922, over 21746.00 frames. ], tot_loss[loss=0.1988, simple_loss=0.2758, pruned_loss=0.06094, over 4278191.42 frames. ], batch size: 414, lr: 2.40e-03, grad_scale: 8.0 2023-06-28 19:34:45,239 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.793e+02 7.099e+02 1.131e+03 1.724e+03 3.417e+03, threshold=2.262e+03, percent-clipped=22.0 2023-06-28 19:34:45,999 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2129946.0, ans=0.0 2023-06-28 19:35:39,165 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.59 vs. limit=15.0 2023-06-28 19:35:48,610 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2130126.0, ans=0.125 2023-06-28 19:36:25,911 INFO [train.py:996] (0/4) Epoch 12, batch 19600, loss[loss=0.1874, simple_loss=0.2746, pruned_loss=0.05009, over 21659.00 frames. ], tot_loss[loss=0.1998, simple_loss=0.2773, pruned_loss=0.06115, over 4281392.67 frames. ], batch size: 263, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 19:36:33,304 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2130246.0, ans=0.125 2023-06-28 19:37:40,725 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=2130426.0, ans=0.2 2023-06-28 19:38:05,793 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.29 vs. limit=15.0 2023-06-28 19:38:14,412 INFO [train.py:996] (0/4) Epoch 12, batch 19650, loss[loss=0.2206, simple_loss=0.3029, pruned_loss=0.06913, over 20090.00 frames. ], tot_loss[loss=0.2054, simple_loss=0.2818, pruned_loss=0.06448, over 4283556.62 frames. ], batch size: 704, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 19:38:16,158 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.322e+02 7.698e+02 1.187e+03 1.875e+03 3.672e+03, threshold=2.374e+03, percent-clipped=11.0 2023-06-28 19:38:20,190 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=2130546.0, ans=0.125 2023-06-28 19:38:52,603 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2130666.0, ans=0.125 2023-06-28 19:39:39,177 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.27 vs. limit=6.0 2023-06-28 19:40:00,294 INFO [train.py:996] (0/4) Epoch 12, batch 19700, loss[loss=0.2381, simple_loss=0.3284, pruned_loss=0.07394, over 21516.00 frames. ], tot_loss[loss=0.2079, simple_loss=0.2854, pruned_loss=0.06526, over 4289767.34 frames. ], batch size: 471, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 19:40:18,443 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=2130846.0, ans=0.125 2023-06-28 19:41:38,250 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.54 vs. limit=10.0 2023-06-28 19:41:50,333 INFO [train.py:996] (0/4) Epoch 12, batch 19750, loss[loss=0.2668, simple_loss=0.3464, pruned_loss=0.09364, over 21735.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2937, pruned_loss=0.06657, over 4281160.10 frames. ], batch size: 441, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 19:41:51,906 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.980e+02 8.894e+02 1.243e+03 1.861e+03 5.840e+03, threshold=2.486e+03, percent-clipped=14.0 2023-06-28 19:42:14,751 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.64 vs. limit=15.0 2023-06-28 19:42:54,710 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2131326.0, ans=0.125 2023-06-28 19:42:57,932 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2131326.0, ans=0.0 2023-06-28 19:43:09,380 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=2131386.0, ans=0.125 2023-06-28 19:43:10,821 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=2131386.0, ans=10.0 2023-06-28 19:43:31,913 INFO [train.py:996] (0/4) Epoch 12, batch 19800, loss[loss=0.1678, simple_loss=0.2467, pruned_loss=0.04446, over 21733.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2949, pruned_loss=0.06763, over 4285767.86 frames. ], batch size: 247, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 19:43:32,498 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=2131446.0, ans=0.0 2023-06-28 19:44:58,819 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2131686.0, ans=0.1 2023-06-28 19:45:16,348 INFO [train.py:996] (0/4) Epoch 12, batch 19850, loss[loss=0.1542, simple_loss=0.2314, pruned_loss=0.03846, over 21382.00 frames. ], tot_loss[loss=0.2076, simple_loss=0.2879, pruned_loss=0.06362, over 4284278.96 frames. ], batch size: 131, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 19:45:18,109 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.069e+02 7.581e+02 9.843e+02 1.508e+03 3.551e+03, threshold=1.969e+03, percent-clipped=6.0 2023-06-28 19:45:37,458 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=2131806.0, ans=0.125 2023-06-28 19:45:42,606 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2131806.0, ans=0.1 2023-06-28 19:46:51,676 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=2131986.0, ans=0.2 2023-06-28 19:46:53,141 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2131986.0, ans=0.125 2023-06-28 19:46:59,306 INFO [train.py:996] (0/4) Epoch 12, batch 19900, loss[loss=0.1959, simple_loss=0.3011, pruned_loss=0.04532, over 21303.00 frames. ], tot_loss[loss=0.205, simple_loss=0.2881, pruned_loss=0.06094, over 4272144.79 frames. ], batch size: 548, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 19:48:03,043 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2132226.0, ans=0.1 2023-06-28 19:48:42,883 INFO [train.py:996] (0/4) Epoch 12, batch 19950, loss[loss=0.1757, simple_loss=0.2446, pruned_loss=0.05342, over 21192.00 frames. ], tot_loss[loss=0.201, simple_loss=0.2822, pruned_loss=0.0599, over 4260526.44 frames. ], batch size: 143, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 19:48:44,577 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.913e+02 9.095e+02 1.320e+03 1.827e+03 2.856e+03, threshold=2.640e+03, percent-clipped=20.0 2023-06-28 19:48:47,444 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.54 vs. limit=22.5 2023-06-28 19:49:01,146 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.31 vs. limit=15.0 2023-06-28 19:49:04,097 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=2132406.0, ans=0.025 2023-06-28 19:49:06,058 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.95 vs. limit=6.0 2023-06-28 19:50:19,983 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=2132586.0, ans=0.0 2023-06-28 19:50:25,842 INFO [train.py:996] (0/4) Epoch 12, batch 20000, loss[loss=0.2168, simple_loss=0.3095, pruned_loss=0.06205, over 21713.00 frames. ], tot_loss[loss=0.2031, simple_loss=0.284, pruned_loss=0.0611, over 4268405.58 frames. ], batch size: 351, lr: 2.40e-03, grad_scale: 32.0 2023-06-28 19:52:06,802 INFO [train.py:996] (0/4) Epoch 12, batch 20050, loss[loss=0.2108, simple_loss=0.2787, pruned_loss=0.07145, over 21544.00 frames. ], tot_loss[loss=0.2063, simple_loss=0.2862, pruned_loss=0.0632, over 4266415.35 frames. ], batch size: 548, lr: 2.40e-03, grad_scale: 32.0 2023-06-28 19:52:08,364 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.463e+02 7.625e+02 1.079e+03 1.735e+03 4.168e+03, threshold=2.158e+03, percent-clipped=5.0 2023-06-28 19:52:13,691 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=2132946.0, ans=0.0 2023-06-28 19:52:23,437 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2132946.0, ans=0.1 2023-06-28 19:53:17,113 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.10 vs. limit=6.0 2023-06-28 19:53:40,134 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 19:53:44,638 INFO [train.py:996] (0/4) Epoch 12, batch 20100, loss[loss=0.211, simple_loss=0.2876, pruned_loss=0.06723, over 21451.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.2887, pruned_loss=0.06523, over 4273883.99 frames. ], batch size: 211, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 19:53:50,798 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.62 vs. limit=15.0 2023-06-28 19:54:33,265 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2133366.0, ans=0.125 2023-06-28 19:55:10,525 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 19:55:15,558 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2133486.0, ans=0.125 2023-06-28 19:55:38,340 INFO [train.py:996] (0/4) Epoch 12, batch 20150, loss[loss=0.2301, simple_loss=0.3099, pruned_loss=0.07516, over 21763.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2972, pruned_loss=0.06843, over 4277616.37 frames. ], batch size: 298, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 19:55:41,577 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.747e+02 8.369e+02 1.261e+03 1.979e+03 4.381e+03, threshold=2.521e+03, percent-clipped=21.0 2023-06-28 19:55:49,189 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2133546.0, ans=0.0 2023-06-28 19:56:25,314 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2133666.0, ans=0.125 2023-06-28 19:56:46,050 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=9.64 vs. limit=15.0 2023-06-28 19:57:11,792 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=2133786.0, ans=0.0 2023-06-28 19:57:24,631 INFO [train.py:996] (0/4) Epoch 12, batch 20200, loss[loss=0.2161, simple_loss=0.3018, pruned_loss=0.06521, over 21682.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.3018, pruned_loss=0.07022, over 4272239.79 frames. ], batch size: 298, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 19:57:51,887 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2133906.0, ans=0.125 2023-06-28 19:58:01,542 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2133906.0, ans=0.125 2023-06-28 19:58:59,389 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=2134086.0, ans=10.0 2023-06-28 19:59:01,023 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=2134086.0, ans=0.04949747468305833 2023-06-28 19:59:11,833 INFO [train.py:996] (0/4) Epoch 12, batch 20250, loss[loss=0.1981, simple_loss=0.2871, pruned_loss=0.05457, over 21788.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.3026, pruned_loss=0.06882, over 4274354.72 frames. ], batch size: 282, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 19:59:19,724 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.497e+02 8.759e+02 1.398e+03 2.270e+03 4.094e+03, threshold=2.796e+03, percent-clipped=18.0 2023-06-28 19:59:25,441 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=2134146.0, ans=0.95 2023-06-28 20:00:10,908 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2134266.0, ans=0.125 2023-06-28 20:00:14,742 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.02 vs. limit=22.5 2023-06-28 20:00:53,963 INFO [train.py:996] (0/4) Epoch 12, batch 20300, loss[loss=0.166, simple_loss=0.2443, pruned_loss=0.04386, over 21786.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.3015, pruned_loss=0.0668, over 4268616.58 frames. ], batch size: 124, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:01:11,032 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=2134446.0, ans=0.07 2023-06-28 20:01:43,040 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.90 vs. limit=15.0 2023-06-28 20:02:06,691 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2134626.0, ans=0.125 2023-06-28 20:02:33,719 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.40 vs. limit=15.0 2023-06-28 20:02:34,212 INFO [train.py:996] (0/4) Epoch 12, batch 20350, loss[loss=0.2293, simple_loss=0.3024, pruned_loss=0.07816, over 21797.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.3006, pruned_loss=0.0661, over 4262762.21 frames. ], batch size: 351, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:02:37,269 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.278e+02 8.027e+02 1.220e+03 1.701e+03 2.990e+03, threshold=2.441e+03, percent-clipped=1.0 2023-06-28 20:03:44,614 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=2134926.0, ans=0.0 2023-06-28 20:04:21,621 INFO [train.py:996] (0/4) Epoch 12, batch 20400, loss[loss=0.2451, simple_loss=0.3266, pruned_loss=0.08183, over 21913.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.3031, pruned_loss=0.06854, over 4266112.26 frames. ], batch size: 107, lr: 2.40e-03, grad_scale: 32.0 2023-06-28 20:04:25,862 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2135046.0, ans=0.1 2023-06-28 20:04:48,671 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=2135106.0, ans=0.2 2023-06-28 20:04:50,438 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=2135106.0, ans=0.2 2023-06-28 20:04:55,529 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 20:04:56,961 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2135166.0, ans=0.1 2023-06-28 20:05:18,438 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=2135166.0, ans=0.0 2023-06-28 20:05:18,459 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2135166.0, ans=0.125 2023-06-28 20:05:39,339 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2135286.0, ans=0.125 2023-06-28 20:05:58,028 INFO [train.py:996] (0/4) Epoch 12, batch 20450, loss[loss=0.2658, simple_loss=0.322, pruned_loss=0.1048, over 21826.00 frames. ], tot_loss[loss=0.223, simple_loss=0.3036, pruned_loss=0.07117, over 4263211.05 frames. ], batch size: 441, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:05:58,754 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 20:06:03,015 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.519e+02 7.818e+02 1.125e+03 1.970e+03 4.809e+03, threshold=2.251e+03, percent-clipped=13.0 2023-06-28 20:06:17,948 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2135406.0, ans=0.1 2023-06-28 20:06:21,514 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2135406.0, ans=0.0 2023-06-28 20:06:30,069 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=2135406.0, ans=0.0 2023-06-28 20:06:30,608 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.77 vs. limit=22.5 2023-06-28 20:07:01,576 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.37 vs. limit=15.0 2023-06-28 20:07:39,479 INFO [train.py:996] (0/4) Epoch 12, batch 20500, loss[loss=0.1979, simple_loss=0.2718, pruned_loss=0.06203, over 21790.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.3002, pruned_loss=0.07136, over 4245410.05 frames. ], batch size: 333, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:07:48,525 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2135646.0, ans=0.125 2023-06-28 20:08:47,967 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=2135826.0, ans=0.035 2023-06-28 20:09:27,003 INFO [train.py:996] (0/4) Epoch 12, batch 20550, loss[loss=0.1672, simple_loss=0.223, pruned_loss=0.05572, over 20883.00 frames. ], tot_loss[loss=0.216, simple_loss=0.2922, pruned_loss=0.06986, over 4246723.70 frames. ], batch size: 608, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:09:27,524 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=2135946.0, ans=0.1 2023-06-28 20:09:32,111 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.925e+02 7.744e+02 1.015e+03 1.488e+03 3.056e+03, threshold=2.029e+03, percent-clipped=4.0 2023-06-28 20:09:40,544 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-356000.pt 2023-06-28 20:10:56,915 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=2136186.0, ans=0.0 2023-06-28 20:11:04,749 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2136186.0, ans=0.0 2023-06-28 20:11:07,031 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.63 vs. limit=15.0 2023-06-28 20:11:08,415 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.91 vs. limit=15.0 2023-06-28 20:11:10,575 INFO [train.py:996] (0/4) Epoch 12, batch 20600, loss[loss=0.2295, simple_loss=0.3017, pruned_loss=0.07859, over 21329.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2946, pruned_loss=0.0684, over 4250048.80 frames. ], batch size: 143, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:11:19,186 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 20:11:44,015 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.33 vs. limit=8.0 2023-06-28 20:11:57,813 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2136366.0, ans=0.1 2023-06-28 20:12:41,663 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=2136486.0, ans=0.2 2023-06-28 20:12:45,954 INFO [train.py:996] (0/4) Epoch 12, batch 20650, loss[loss=0.1809, simple_loss=0.2471, pruned_loss=0.05731, over 21639.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2912, pruned_loss=0.0685, over 4249598.82 frames. ], batch size: 247, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:12:51,204 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.648e+02 9.695e+02 1.455e+03 2.228e+03 5.123e+03, threshold=2.910e+03, percent-clipped=30.0 2023-06-28 20:13:11,654 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2136606.0, ans=0.125 2023-06-28 20:13:55,830 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2136726.0, ans=0.125 2023-06-28 20:14:28,009 INFO [train.py:996] (0/4) Epoch 12, batch 20700, loss[loss=0.2108, simple_loss=0.3188, pruned_loss=0.05135, over 20742.00 frames. ], tot_loss[loss=0.2078, simple_loss=0.2842, pruned_loss=0.06573, over 4252607.68 frames. ], batch size: 608, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:14:35,125 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=2136846.0, ans=0.125 2023-06-28 20:15:02,384 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2136966.0, ans=0.125 2023-06-28 20:16:09,300 INFO [train.py:996] (0/4) Epoch 12, batch 20750, loss[loss=0.3031, simple_loss=0.3919, pruned_loss=0.1072, over 21575.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.2883, pruned_loss=0.06514, over 4251787.21 frames. ], batch size: 471, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:16:14,420 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.341e+02 7.769e+02 1.310e+03 2.249e+03 6.727e+03, threshold=2.619e+03, percent-clipped=13.0 2023-06-28 20:17:14,394 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=2137326.0, ans=0.0 2023-06-28 20:17:29,465 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=2137326.0, ans=10.0 2023-06-28 20:17:51,061 INFO [train.py:996] (0/4) Epoch 12, batch 20800, loss[loss=0.1962, simple_loss=0.2644, pruned_loss=0.06406, over 20691.00 frames. ], tot_loss[loss=0.212, simple_loss=0.2915, pruned_loss=0.06626, over 4252025.62 frames. ], batch size: 607, lr: 2.40e-03, grad_scale: 32.0 2023-06-28 20:18:01,927 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2137446.0, ans=0.1 2023-06-28 20:19:05,822 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=2137626.0, ans=0.04949747468305833 2023-06-28 20:19:12,594 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.76 vs. limit=22.5 2023-06-28 20:19:32,045 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=2137746.0, ans=0.0 2023-06-28 20:19:33,043 INFO [train.py:996] (0/4) Epoch 12, batch 20850, loss[loss=0.1985, simple_loss=0.2728, pruned_loss=0.06208, over 21823.00 frames. ], tot_loss[loss=0.2057, simple_loss=0.2835, pruned_loss=0.06394, over 4257183.89 frames. ], batch size: 371, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:19:39,613 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.878e+02 7.517e+02 1.058e+03 1.433e+03 3.063e+03, threshold=2.117e+03, percent-clipped=2.0 2023-06-28 20:20:58,002 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=2137986.0, ans=0.2 2023-06-28 20:21:10,316 INFO [train.py:996] (0/4) Epoch 12, batch 20900, loss[loss=0.1998, simple_loss=0.2773, pruned_loss=0.06114, over 21253.00 frames. ], tot_loss[loss=0.207, simple_loss=0.284, pruned_loss=0.06503, over 4267147.39 frames. ], batch size: 176, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:21:31,501 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 20:21:48,781 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2138106.0, ans=0.0 2023-06-28 20:22:48,736 INFO [train.py:996] (0/4) Epoch 12, batch 20950, loss[loss=0.2015, simple_loss=0.277, pruned_loss=0.063, over 21796.00 frames. ], tot_loss[loss=0.203, simple_loss=0.2813, pruned_loss=0.06236, over 4256863.57 frames. ], batch size: 112, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:22:55,230 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.730e+02 8.164e+02 1.366e+03 2.074e+03 5.785e+03, threshold=2.733e+03, percent-clipped=24.0 2023-06-28 20:23:04,317 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=2138406.0, ans=0.0 2023-06-28 20:23:05,750 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2138406.0, ans=0.0 2023-06-28 20:23:49,746 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2138526.0, ans=0.125 2023-06-28 20:23:51,498 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=2138526.0, ans=0.2 2023-06-28 20:23:51,571 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=2138526.0, ans=0.2 2023-06-28 20:24:11,521 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=10.08 vs. limit=15.0 2023-06-28 20:24:21,569 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2138586.0, ans=0.1 2023-06-28 20:24:24,237 INFO [train.py:996] (0/4) Epoch 12, batch 21000, loss[loss=0.2098, simple_loss=0.2751, pruned_loss=0.07223, over 21567.00 frames. ], tot_loss[loss=0.2031, simple_loss=0.2802, pruned_loss=0.06306, over 4265340.28 frames. ], batch size: 548, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:24:24,238 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-28 20:24:40,719 INFO [train.py:1028] (0/4) Epoch 12, validation: loss=0.2646, simple_loss=0.357, pruned_loss=0.08608, over 1796401.00 frames. 2023-06-28 20:24:40,721 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23714MB 2023-06-28 20:24:43,976 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.72 vs. limit=15.0 2023-06-28 20:24:53,219 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2138646.0, ans=0.125 2023-06-28 20:25:14,889 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.93 vs. limit=10.0 2023-06-28 20:26:07,304 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=2138886.0, ans=0.125 2023-06-28 20:26:21,498 INFO [train.py:996] (0/4) Epoch 12, batch 21050, loss[loss=0.1812, simple_loss=0.2494, pruned_loss=0.05653, over 21757.00 frames. ], tot_loss[loss=0.2018, simple_loss=0.278, pruned_loss=0.06283, over 4270575.09 frames. ], batch size: 351, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:26:28,202 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.708e+02 6.795e+02 9.340e+02 1.308e+03 3.165e+03, threshold=1.868e+03, percent-clipped=2.0 2023-06-28 20:26:28,727 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2138946.0, ans=0.125 2023-06-28 20:27:26,521 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2139126.0, ans=0.0 2023-06-28 20:28:01,129 INFO [train.py:996] (0/4) Epoch 12, batch 21100, loss[loss=0.1808, simple_loss=0.2487, pruned_loss=0.05647, over 21321.00 frames. ], tot_loss[loss=0.2003, simple_loss=0.2752, pruned_loss=0.06271, over 4272736.18 frames. ], batch size: 177, lr: 2.40e-03, grad_scale: 8.0 2023-06-28 20:28:09,675 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=2139246.0, ans=0.125 2023-06-28 20:28:37,388 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=2139306.0, ans=0.09899494936611666 2023-06-28 20:28:43,874 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 20:28:57,562 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=2139366.0, ans=15.0 2023-06-28 20:29:18,816 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.45 vs. limit=15.0 2023-06-28 20:29:19,785 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=2139486.0, ans=0.2 2023-06-28 20:29:42,284 INFO [train.py:996] (0/4) Epoch 12, batch 21150, loss[loss=0.174, simple_loss=0.2263, pruned_loss=0.06087, over 20801.00 frames. ], tot_loss[loss=0.1973, simple_loss=0.2698, pruned_loss=0.0624, over 4264819.31 frames. ], batch size: 608, lr: 2.40e-03, grad_scale: 8.0 2023-06-28 20:29:50,627 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.872e+02 8.259e+02 1.205e+03 1.749e+03 3.220e+03, threshold=2.410e+03, percent-clipped=20.0 2023-06-28 20:30:45,045 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2139726.0, ans=0.125 2023-06-28 20:30:55,538 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.41 vs. limit=15.0 2023-06-28 20:30:58,252 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2139726.0, ans=0.1 2023-06-28 20:31:05,401 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.20 vs. limit=12.0 2023-06-28 20:31:23,260 INFO [train.py:996] (0/4) Epoch 12, batch 21200, loss[loss=0.1892, simple_loss=0.2557, pruned_loss=0.06141, over 21305.00 frames. ], tot_loss[loss=0.1946, simple_loss=0.2662, pruned_loss=0.06152, over 4263869.81 frames. ], batch size: 176, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:31:44,843 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=2139906.0, ans=0.125 2023-06-28 20:32:46,897 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2140086.0, ans=0.1 2023-06-28 20:32:48,536 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2140086.0, ans=0.125 2023-06-28 20:33:04,761 INFO [train.py:996] (0/4) Epoch 12, batch 21250, loss[loss=0.1783, simple_loss=0.2422, pruned_loss=0.05721, over 21333.00 frames. ], tot_loss[loss=0.1942, simple_loss=0.265, pruned_loss=0.0617, over 4249464.43 frames. ], batch size: 211, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:33:13,142 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.761e+02 7.355e+02 9.747e+02 1.370e+03 2.666e+03, threshold=1.949e+03, percent-clipped=4.0 2023-06-28 20:33:15,317 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2140146.0, ans=0.1 2023-06-28 20:33:28,760 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=2140206.0, ans=0.125 2023-06-28 20:33:43,481 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=2140206.0, ans=0.2 2023-06-28 20:33:52,017 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.15 vs. limit=15.0 2023-06-28 20:34:01,763 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=2140266.0, ans=0.125 2023-06-28 20:34:08,186 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=2140326.0, ans=0.125 2023-06-28 20:34:47,112 INFO [train.py:996] (0/4) Epoch 12, batch 21300, loss[loss=0.2086, simple_loss=0.2889, pruned_loss=0.0642, over 21909.00 frames. ], tot_loss[loss=0.2011, simple_loss=0.2731, pruned_loss=0.06459, over 4255065.57 frames. ], batch size: 316, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:34:49,598 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=2140446.0, ans=0.95 2023-06-28 20:34:54,394 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=2140446.0, ans=0.035 2023-06-28 20:35:01,447 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=2140446.0, ans=0.2 2023-06-28 20:35:35,356 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2140566.0, ans=0.0 2023-06-28 20:35:35,438 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=2140566.0, ans=0.2 2023-06-28 20:35:48,691 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2140566.0, ans=0.125 2023-06-28 20:36:04,064 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 20:36:29,986 INFO [train.py:996] (0/4) Epoch 12, batch 21350, loss[loss=0.1758, simple_loss=0.2731, pruned_loss=0.03924, over 21835.00 frames. ], tot_loss[loss=0.2036, simple_loss=0.2772, pruned_loss=0.06495, over 4267476.44 frames. ], batch size: 333, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:36:38,305 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.98 vs. limit=15.0 2023-06-28 20:36:43,154 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.041e+02 8.389e+02 1.153e+03 1.810e+03 4.461e+03, threshold=2.306e+03, percent-clipped=20.0 2023-06-28 20:36:51,761 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2140746.0, ans=0.125 2023-06-28 20:37:25,600 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.21 vs. limit=15.0 2023-06-28 20:37:45,805 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.83 vs. limit=15.0 2023-06-28 20:38:16,935 INFO [train.py:996] (0/4) Epoch 12, batch 21400, loss[loss=0.1987, simple_loss=0.2827, pruned_loss=0.05738, over 21788.00 frames. ], tot_loss[loss=0.2053, simple_loss=0.2812, pruned_loss=0.06469, over 4274016.73 frames. ], batch size: 282, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:38:35,202 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=2141046.0, ans=0.125 2023-06-28 20:39:33,678 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2141226.0, ans=0.125 2023-06-28 20:39:48,496 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=2141286.0, ans=0.2 2023-06-28 20:39:57,095 INFO [train.py:996] (0/4) Epoch 12, batch 21450, loss[loss=0.2136, simple_loss=0.2871, pruned_loss=0.07006, over 21553.00 frames. ], tot_loss[loss=0.2075, simple_loss=0.2844, pruned_loss=0.06534, over 4282501.09 frames. ], batch size: 131, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:40:04,992 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.076e+02 7.437e+02 1.005e+03 1.722e+03 2.921e+03, threshold=2.009e+03, percent-clipped=6.0 2023-06-28 20:40:13,840 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=2141346.0, ans=0.0 2023-06-28 20:40:28,389 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=2141406.0, ans=0.09899494936611666 2023-06-28 20:41:02,587 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=2141526.0, ans=0.125 2023-06-28 20:41:38,446 INFO [train.py:996] (0/4) Epoch 12, batch 21500, loss[loss=0.2097, simple_loss=0.2714, pruned_loss=0.07402, over 21735.00 frames. ], tot_loss[loss=0.2071, simple_loss=0.2821, pruned_loss=0.06609, over 4286562.32 frames. ], batch size: 316, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:41:48,823 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=2141646.0, ans=0.04949747468305833 2023-06-28 20:41:52,353 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2141646.0, ans=0.0 2023-06-28 20:42:05,948 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 20:43:19,795 INFO [train.py:996] (0/4) Epoch 12, batch 21550, loss[loss=0.1495, simple_loss=0.2259, pruned_loss=0.03654, over 21644.00 frames. ], tot_loss[loss=0.2008, simple_loss=0.2745, pruned_loss=0.06356, over 4278890.13 frames. ], batch size: 263, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:43:32,809 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.830e+02 7.462e+02 9.978e+02 1.500e+03 2.892e+03, threshold=1.996e+03, percent-clipped=12.0 2023-06-28 20:43:33,367 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=2141946.0, ans=0.125 2023-06-28 20:43:56,434 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2142006.0, ans=0.0 2023-06-28 20:44:18,421 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=2142126.0, ans=0.125 2023-06-28 20:44:46,026 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.00 vs. limit=12.0 2023-06-28 20:44:59,464 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=2142186.0, ans=0.0 2023-06-28 20:45:03,643 INFO [train.py:996] (0/4) Epoch 12, batch 21600, loss[loss=0.1734, simple_loss=0.2456, pruned_loss=0.05058, over 21632.00 frames. ], tot_loss[loss=0.1967, simple_loss=0.2699, pruned_loss=0.06175, over 4278003.30 frames. ], batch size: 264, lr: 2.40e-03, grad_scale: 32.0 2023-06-28 20:45:19,766 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=2142246.0, ans=0.0 2023-06-28 20:45:50,422 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=2142366.0, ans=0.07 2023-06-28 20:45:59,823 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2142366.0, ans=0.125 2023-06-28 20:46:49,158 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2142486.0, ans=0.125 2023-06-28 20:46:51,890 INFO [train.py:996] (0/4) Epoch 12, batch 21650, loss[loss=0.1904, simple_loss=0.2835, pruned_loss=0.04861, over 21263.00 frames. ], tot_loss[loss=0.1993, simple_loss=0.2772, pruned_loss=0.06068, over 4274531.75 frames. ], batch size: 176, lr: 2.40e-03, grad_scale: 8.0 2023-06-28 20:47:03,117 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.132e+02 8.434e+02 1.336e+03 2.286e+03 3.969e+03, threshold=2.673e+03, percent-clipped=30.0 2023-06-28 20:47:41,005 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=2142666.0, ans=0.035 2023-06-28 20:48:08,412 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=2142786.0, ans=0.07 2023-06-28 20:48:26,749 INFO [train.py:996] (0/4) Epoch 12, batch 21700, loss[loss=0.1897, simple_loss=0.2836, pruned_loss=0.0479, over 21593.00 frames. ], tot_loss[loss=0.1986, simple_loss=0.278, pruned_loss=0.05963, over 4266578.30 frames. ], batch size: 230, lr: 2.40e-03, grad_scale: 8.0 2023-06-28 20:48:32,555 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2142846.0, ans=0.125 2023-06-28 20:48:37,764 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.58 vs. limit=22.5 2023-06-28 20:49:44,553 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=2143026.0, ans=0.125 2023-06-28 20:49:50,883 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2143086.0, ans=0.1 2023-06-28 20:49:53,068 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.13 vs. limit=10.0 2023-06-28 20:50:06,737 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=2143146.0, ans=0.0 2023-06-28 20:50:07,610 INFO [train.py:996] (0/4) Epoch 12, batch 21750, loss[loss=0.1667, simple_loss=0.2219, pruned_loss=0.05579, over 20782.00 frames. ], tot_loss[loss=0.1965, simple_loss=0.2731, pruned_loss=0.05997, over 4275051.36 frames. ], batch size: 608, lr: 2.40e-03, grad_scale: 8.0 2023-06-28 20:50:24,253 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.270e+02 7.010e+02 1.001e+03 1.482e+03 3.293e+03, threshold=2.002e+03, percent-clipped=2.0 2023-06-28 20:50:41,630 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.84 vs. limit=15.0 2023-06-28 20:50:57,061 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=2143266.0, ans=0.07 2023-06-28 20:51:08,315 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=2143266.0, ans=0.09899494936611666 2023-06-28 20:51:54,918 INFO [train.py:996] (0/4) Epoch 12, batch 21800, loss[loss=0.263, simple_loss=0.3371, pruned_loss=0.09444, over 21450.00 frames. ], tot_loss[loss=0.1958, simple_loss=0.27, pruned_loss=0.06081, over 4273091.21 frames. ], batch size: 473, lr: 2.40e-03, grad_scale: 8.0 2023-06-28 20:52:45,200 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2143566.0, ans=0.125 2023-06-28 20:53:18,747 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.80 vs. limit=12.0 2023-06-28 20:53:21,463 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=2143686.0, ans=0.125 2023-06-28 20:53:37,021 INFO [train.py:996] (0/4) Epoch 12, batch 21850, loss[loss=0.1983, simple_loss=0.278, pruned_loss=0.05928, over 21804.00 frames. ], tot_loss[loss=0.1998, simple_loss=0.2768, pruned_loss=0.06134, over 4279742.13 frames. ], batch size: 282, lr: 2.40e-03, grad_scale: 8.0 2023-06-28 20:53:41,080 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2143746.0, ans=0.125 2023-06-28 20:53:48,649 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.108e+02 8.276e+02 1.227e+03 1.863e+03 4.037e+03, threshold=2.455e+03, percent-clipped=20.0 2023-06-28 20:54:02,261 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.31 vs. limit=22.5 2023-06-28 20:55:18,295 INFO [train.py:996] (0/4) Epoch 12, batch 21900, loss[loss=0.2038, simple_loss=0.275, pruned_loss=0.06633, over 21552.00 frames. ], tot_loss[loss=0.2001, simple_loss=0.2764, pruned_loss=0.06186, over 4262673.89 frames. ], batch size: 548, lr: 2.40e-03, grad_scale: 8.0 2023-06-28 20:55:28,557 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=2144046.0, ans=0.125 2023-06-28 20:55:41,260 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2144106.0, ans=0.125 2023-06-28 20:56:17,083 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2144226.0, ans=0.125 2023-06-28 20:56:27,921 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2144226.0, ans=0.1 2023-06-28 20:56:38,003 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2144286.0, ans=0.1 2023-06-28 20:56:58,123 INFO [train.py:996] (0/4) Epoch 12, batch 21950, loss[loss=0.1524, simple_loss=0.2423, pruned_loss=0.03128, over 21198.00 frames. ], tot_loss[loss=0.1961, simple_loss=0.2708, pruned_loss=0.0607, over 4252978.69 frames. ], batch size: 548, lr: 2.40e-03, grad_scale: 8.0 2023-06-28 20:57:08,648 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2144346.0, ans=0.1 2023-06-28 20:57:09,571 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.558e+02 7.761e+02 1.147e+03 1.869e+03 4.092e+03, threshold=2.294e+03, percent-clipped=9.0 2023-06-28 20:57:25,215 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.04 vs. limit=22.5 2023-06-28 20:57:31,923 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.33 vs. limit=6.0 2023-06-28 20:57:50,350 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=2144466.0, ans=0.0 2023-06-28 20:57:55,785 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2144526.0, ans=0.125 2023-06-28 20:58:20,788 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2144586.0, ans=0.1 2023-06-28 20:58:32,273 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=2144586.0, ans=0.0 2023-06-28 20:58:36,629 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.54 vs. limit=6.0 2023-06-28 20:58:40,416 INFO [train.py:996] (0/4) Epoch 12, batch 22000, loss[loss=0.1943, simple_loss=0.2715, pruned_loss=0.05854, over 21723.00 frames. ], tot_loss[loss=0.1911, simple_loss=0.2658, pruned_loss=0.05818, over 4257424.92 frames. ], batch size: 333, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:59:15,463 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2144706.0, ans=0.125 2023-06-28 20:59:22,293 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2144766.0, ans=0.0 2023-06-28 20:59:30,299 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2144766.0, ans=0.125 2023-06-28 20:59:32,415 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.38 vs. limit=15.0 2023-06-28 20:59:39,944 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.69 vs. limit=15.0 2023-06-28 21:00:09,015 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.66 vs. limit=15.0 2023-06-28 21:00:16,445 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2144886.0, ans=0.125 2023-06-28 21:00:23,764 INFO [train.py:996] (0/4) Epoch 12, batch 22050, loss[loss=0.2376, simple_loss=0.3182, pruned_loss=0.07852, over 21622.00 frames. ], tot_loss[loss=0.1957, simple_loss=0.2713, pruned_loss=0.06003, over 4258389.32 frames. ], batch size: 230, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 21:00:40,614 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.182e+02 7.125e+02 1.182e+03 1.630e+03 4.961e+03, threshold=2.364e+03, percent-clipped=13.0 2023-06-28 21:00:42,948 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=2144946.0, ans=0.2 2023-06-28 21:01:40,961 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.65 vs. limit=15.0 2023-06-28 21:02:06,218 INFO [train.py:996] (0/4) Epoch 12, batch 22100, loss[loss=0.2118, simple_loss=0.2947, pruned_loss=0.06449, over 21652.00 frames. ], tot_loss[loss=0.2059, simple_loss=0.2821, pruned_loss=0.06484, over 4258232.30 frames. ], batch size: 230, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 21:02:21,769 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2145246.0, ans=0.125 2023-06-28 21:02:22,332 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.26 vs. limit=15.0 2023-06-28 21:02:55,328 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.88 vs. limit=6.0 2023-06-28 21:03:01,184 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2145366.0, ans=0.0 2023-06-28 21:03:46,998 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=2145546.0, ans=10.0 2023-06-28 21:03:47,949 INFO [train.py:996] (0/4) Epoch 12, batch 22150, loss[loss=0.2326, simple_loss=0.3007, pruned_loss=0.08225, over 21791.00 frames. ], tot_loss[loss=0.2072, simple_loss=0.2833, pruned_loss=0.06556, over 4269182.21 frames. ], batch size: 441, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 21:04:04,052 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.507e+02 8.832e+02 1.298e+03 1.809e+03 3.590e+03, threshold=2.596e+03, percent-clipped=11.0 2023-06-28 21:04:19,908 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.80 vs. limit=10.0 2023-06-28 21:04:35,093 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.32 vs. limit=15.0 2023-06-28 21:05:29,498 INFO [train.py:996] (0/4) Epoch 12, batch 22200, loss[loss=0.2222, simple_loss=0.2878, pruned_loss=0.07832, over 22030.00 frames. ], tot_loss[loss=0.209, simple_loss=0.285, pruned_loss=0.0665, over 4277417.66 frames. ], batch size: 416, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 21:06:20,280 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=2145966.0, ans=0.2 2023-06-28 21:06:22,491 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn2.whiten.whitening_limit, batch_count=2145966.0, ans=22.5 2023-06-28 21:07:16,285 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=2146146.0, ans=0.09899494936611666 2023-06-28 21:07:17,287 INFO [train.py:996] (0/4) Epoch 12, batch 22250, loss[loss=0.251, simple_loss=0.3271, pruned_loss=0.08747, over 21399.00 frames. ], tot_loss[loss=0.2142, simple_loss=0.2922, pruned_loss=0.06809, over 4281410.28 frames. ], batch size: 159, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:07:29,283 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.277e+02 8.206e+02 1.186e+03 1.604e+03 3.301e+03, threshold=2.372e+03, percent-clipped=3.0 2023-06-28 21:07:29,842 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2146146.0, ans=0.125 2023-06-28 21:07:47,929 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.99 vs. limit=15.0 2023-06-28 21:07:56,925 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=2146206.0, ans=0.95 2023-06-28 21:08:57,748 INFO [train.py:996] (0/4) Epoch 12, batch 22300, loss[loss=0.21, simple_loss=0.2797, pruned_loss=0.07013, over 21531.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2939, pruned_loss=0.07026, over 4291739.96 frames. ], batch size: 548, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:09:11,447 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2146446.0, ans=0.125 2023-06-28 21:09:26,168 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2146506.0, ans=0.125 2023-06-28 21:09:32,262 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2146506.0, ans=0.125 2023-06-28 21:10:12,769 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2146626.0, ans=0.1 2023-06-28 21:10:34,396 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=2146686.0, ans=0.125 2023-06-28 21:10:38,565 INFO [train.py:996] (0/4) Epoch 12, batch 22350, loss[loss=0.2028, simple_loss=0.2817, pruned_loss=0.062, over 21864.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.2907, pruned_loss=0.07043, over 4302309.46 frames. ], batch size: 351, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:10:50,160 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.715e+02 7.662e+02 1.007e+03 1.656e+03 3.932e+03, threshold=2.013e+03, percent-clipped=14.0 2023-06-28 21:11:23,525 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=2146866.0, ans=0.2 2023-06-28 21:11:58,225 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2146986.0, ans=0.125 2023-06-28 21:12:14,442 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2146986.0, ans=0.125 2023-06-28 21:12:20,278 INFO [train.py:996] (0/4) Epoch 12, batch 22400, loss[loss=0.1888, simple_loss=0.2669, pruned_loss=0.05539, over 21760.00 frames. ], tot_loss[loss=0.2107, simple_loss=0.2873, pruned_loss=0.06706, over 4299989.92 frames. ], batch size: 118, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:13:22,166 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=2147226.0, ans=0.125 2023-06-28 21:13:39,968 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2147286.0, ans=0.125 2023-06-28 21:13:40,720 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.56 vs. limit=15.0 2023-06-28 21:14:05,224 INFO [train.py:996] (0/4) Epoch 12, batch 22450, loss[loss=0.1612, simple_loss=0.209, pruned_loss=0.05667, over 20774.00 frames. ], tot_loss[loss=0.2072, simple_loss=0.2816, pruned_loss=0.06642, over 4286668.51 frames. ], batch size: 608, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:14:06,731 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.48 vs. limit=6.0 2023-06-28 21:14:18,834 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.767e+02 6.974e+02 9.708e+02 1.486e+03 4.519e+03, threshold=1.942e+03, percent-clipped=14.0 2023-06-28 21:15:48,358 INFO [train.py:996] (0/4) Epoch 12, batch 22500, loss[loss=0.2644, simple_loss=0.3489, pruned_loss=0.08996, over 21574.00 frames. ], tot_loss[loss=0.2054, simple_loss=0.2783, pruned_loss=0.06625, over 4274460.45 frames. ], batch size: 441, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:16:23,529 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=2147706.0, ans=0.2 2023-06-28 21:16:41,866 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2147766.0, ans=0.125 2023-06-28 21:17:05,705 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.98 vs. limit=10.0 2023-06-28 21:17:31,319 INFO [train.py:996] (0/4) Epoch 12, batch 22550, loss[loss=0.2082, simple_loss=0.2885, pruned_loss=0.06394, over 21909.00 frames. ], tot_loss[loss=0.2087, simple_loss=0.2844, pruned_loss=0.06644, over 4281750.53 frames. ], batch size: 414, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:17:49,757 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.593e+02 9.385e+02 1.394e+03 1.973e+03 3.224e+03, threshold=2.788e+03, percent-clipped=25.0 2023-06-28 21:17:59,276 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2148006.0, ans=0.125 2023-06-28 21:17:59,298 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2148006.0, ans=0.125 2023-06-28 21:18:14,565 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2148066.0, ans=0.125 2023-06-28 21:18:25,811 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.11 vs. limit=12.0 2023-06-28 21:18:33,264 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=2148066.0, ans=0.09899494936611666 2023-06-28 21:18:38,342 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 21:18:51,562 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=2148126.0, ans=0.0 2023-06-28 21:19:20,503 INFO [train.py:996] (0/4) Epoch 12, batch 22600, loss[loss=0.2268, simple_loss=0.319, pruned_loss=0.06727, over 21681.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.2871, pruned_loss=0.06709, over 4287745.86 frames. ], batch size: 389, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:19:39,735 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=2148306.0, ans=0.0 2023-06-28 21:20:08,962 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=2148366.0, ans=0.0 2023-06-28 21:20:36,622 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=2148426.0, ans=0.125 2023-06-28 21:20:44,932 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2148486.0, ans=0.125 2023-06-28 21:20:53,224 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.75 vs. limit=15.0 2023-06-28 21:21:01,900 INFO [train.py:996] (0/4) Epoch 12, batch 22650, loss[loss=0.2204, simple_loss=0.3322, pruned_loss=0.05431, over 19774.00 frames. ], tot_loss[loss=0.2087, simple_loss=0.2836, pruned_loss=0.06685, over 4281502.81 frames. ], batch size: 703, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:21:14,881 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.226e+02 9.650e+02 1.395e+03 1.973e+03 4.081e+03, threshold=2.791e+03, percent-clipped=9.0 2023-06-28 21:21:26,864 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2148606.0, ans=0.125 2023-06-28 21:22:10,374 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=2148726.0, ans=0.2 2023-06-28 21:22:41,736 INFO [train.py:996] (0/4) Epoch 12, batch 22700, loss[loss=0.2172, simple_loss=0.276, pruned_loss=0.07915, over 21536.00 frames. ], tot_loss[loss=0.2042, simple_loss=0.277, pruned_loss=0.06569, over 4276746.25 frames. ], batch size: 441, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:23:52,109 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=2149026.0, ans=0.125 2023-06-28 21:24:11,764 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2149086.0, ans=0.0 2023-06-28 21:24:23,337 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=2149146.0, ans=0.0 2023-06-28 21:24:24,382 INFO [train.py:996] (0/4) Epoch 12, batch 22750, loss[loss=0.2342, simple_loss=0.3072, pruned_loss=0.08055, over 21952.00 frames. ], tot_loss[loss=0.2064, simple_loss=0.2783, pruned_loss=0.06727, over 4283025.37 frames. ], batch size: 372, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:24:35,171 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2149146.0, ans=0.1 2023-06-28 21:24:37,898 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.797e+02 7.718e+02 1.201e+03 1.681e+03 3.626e+03, threshold=2.402e+03, percent-clipped=4.0 2023-06-28 21:25:30,839 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=2149326.0, ans=0.5 2023-06-28 21:26:05,754 INFO [train.py:996] (0/4) Epoch 12, batch 22800, loss[loss=0.2045, simple_loss=0.2816, pruned_loss=0.06366, over 21897.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.283, pruned_loss=0.06895, over 4279784.44 frames. ], batch size: 316, lr: 2.39e-03, grad_scale: 32.0 2023-06-28 21:26:14,935 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2149446.0, ans=0.0 2023-06-28 21:26:15,592 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.85 vs. limit=15.0 2023-06-28 21:26:56,410 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=2149566.0, ans=0.125 2023-06-28 21:27:30,372 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.17 vs. limit=15.0 2023-06-28 21:27:31,599 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=2149686.0, ans=0.2 2023-06-28 21:27:39,036 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.30 vs. limit=15.0 2023-06-28 21:27:40,406 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.28 vs. limit=12.0 2023-06-28 21:27:45,950 INFO [train.py:996] (0/4) Epoch 12, batch 22850, loss[loss=0.1928, simple_loss=0.2542, pruned_loss=0.0657, over 21546.00 frames. ], tot_loss[loss=0.208, simple_loss=0.2797, pruned_loss=0.06817, over 4273538.25 frames. ], batch size: 414, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:27:47,351 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.68 vs. limit=15.0 2023-06-28 21:28:01,321 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.804e+02 7.642e+02 1.050e+03 1.882e+03 3.484e+03, threshold=2.099e+03, percent-clipped=13.0 2023-06-28 21:28:12,080 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2149806.0, ans=0.125 2023-06-28 21:29:17,145 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2149986.0, ans=0.1 2023-06-28 21:29:30,153 INFO [train.py:996] (0/4) Epoch 12, batch 22900, loss[loss=0.2013, simple_loss=0.3149, pruned_loss=0.0438, over 21690.00 frames. ], tot_loss[loss=0.2085, simple_loss=0.282, pruned_loss=0.0675, over 4271322.47 frames. ], batch size: 298, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:29:55,800 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2150106.0, ans=0.125 2023-06-28 21:30:16,263 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2150166.0, ans=0.1 2023-06-28 21:31:19,872 INFO [train.py:996] (0/4) Epoch 12, batch 22950, loss[loss=0.1955, simple_loss=0.2484, pruned_loss=0.07132, over 20356.00 frames. ], tot_loss[loss=0.213, simple_loss=0.2936, pruned_loss=0.06619, over 4267841.69 frames. ], batch size: 703, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:31:25,486 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=2150346.0, ans=0.2 2023-06-28 21:31:39,642 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.589e+02 9.756e+02 1.509e+03 2.315e+03 4.900e+03, threshold=3.017e+03, percent-clipped=30.0 2023-06-28 21:31:40,511 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=2150406.0, ans=0.125 2023-06-28 21:33:02,909 INFO [train.py:996] (0/4) Epoch 12, batch 23000, loss[loss=0.213, simple_loss=0.2917, pruned_loss=0.06709, over 21910.00 frames. ], tot_loss[loss=0.2103, simple_loss=0.2924, pruned_loss=0.06408, over 4275888.32 frames. ], batch size: 333, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:33:05,300 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2150646.0, ans=0.0 2023-06-28 21:33:22,768 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.13 vs. limit=6.0 2023-06-28 21:33:45,369 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=2150766.0, ans=0.2 2023-06-28 21:33:53,224 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2150766.0, ans=0.1 2023-06-28 21:34:12,884 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=2150826.0, ans=0.0 2023-06-28 21:34:33,516 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2150886.0, ans=0.125 2023-06-28 21:34:38,487 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2150886.0, ans=0.0 2023-06-28 21:34:51,527 INFO [train.py:996] (0/4) Epoch 12, batch 23050, loss[loss=0.2266, simple_loss=0.3025, pruned_loss=0.07531, over 21444.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2935, pruned_loss=0.06578, over 4275550.32 frames. ], batch size: 194, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:35:10,952 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.719e+02 9.558e+02 1.419e+03 1.890e+03 3.669e+03, threshold=2.838e+03, percent-clipped=6.0 2023-06-28 21:35:11,636 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff2.min_abs, batch_count=2151006.0, ans=0.1 2023-06-28 21:35:51,055 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.49 vs. limit=15.0 2023-06-28 21:36:34,600 INFO [train.py:996] (0/4) Epoch 12, batch 23100, loss[loss=0.1706, simple_loss=0.2395, pruned_loss=0.05084, over 21787.00 frames. ], tot_loss[loss=0.2121, simple_loss=0.2901, pruned_loss=0.06699, over 4275307.89 frames. ], batch size: 317, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:36:56,600 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2151306.0, ans=0.125 2023-06-28 21:38:16,277 INFO [train.py:996] (0/4) Epoch 12, batch 23150, loss[loss=0.1609, simple_loss=0.2344, pruned_loss=0.04375, over 21519.00 frames. ], tot_loss[loss=0.2086, simple_loss=0.2846, pruned_loss=0.06629, over 4283151.05 frames. ], batch size: 230, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:38:21,617 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2151546.0, ans=0.125 2023-06-28 21:38:30,883 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.983e+02 7.198e+02 1.006e+03 1.345e+03 2.860e+03, threshold=2.012e+03, percent-clipped=2.0 2023-06-28 21:38:33,767 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.24 vs. limit=22.5 2023-06-28 21:38:46,585 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.82 vs. limit=15.0 2023-06-28 21:39:02,522 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=5.47 vs. limit=22.5 2023-06-28 21:39:57,505 INFO [train.py:996] (0/4) Epoch 12, batch 23200, loss[loss=0.1822, simple_loss=0.2571, pruned_loss=0.05364, over 21825.00 frames. ], tot_loss[loss=0.208, simple_loss=0.2828, pruned_loss=0.06657, over 4285925.22 frames. ], batch size: 282, lr: 2.39e-03, grad_scale: 32.0 2023-06-28 21:40:33,481 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=2151966.0, ans=0.0 2023-06-28 21:40:50,085 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.43 vs. limit=22.5 2023-06-28 21:41:12,576 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=2152086.0, ans=0.0 2023-06-28 21:41:38,929 INFO [train.py:996] (0/4) Epoch 12, batch 23250, loss[loss=0.2734, simple_loss=0.3242, pruned_loss=0.1114, over 21640.00 frames. ], tot_loss[loss=0.2076, simple_loss=0.2818, pruned_loss=0.06668, over 4286086.22 frames. ], batch size: 507, lr: 2.39e-03, grad_scale: 32.0 2023-06-28 21:41:58,598 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.316e+02 9.370e+02 1.450e+03 2.114e+03 3.490e+03, threshold=2.900e+03, percent-clipped=30.0 2023-06-28 21:42:09,196 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=2152206.0, ans=0.125 2023-06-28 21:42:26,481 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.89 vs. limit=15.0 2023-06-28 21:42:31,083 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=2152266.0, ans=0.2 2023-06-28 21:42:43,174 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=2152326.0, ans=0.125 2023-06-28 21:42:50,204 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.48 vs. limit=8.0 2023-06-28 21:43:17,366 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2152386.0, ans=0.125 2023-06-28 21:43:22,223 INFO [train.py:996] (0/4) Epoch 12, batch 23300, loss[loss=0.2437, simple_loss=0.3491, pruned_loss=0.06919, over 21676.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2892, pruned_loss=0.06802, over 4290973.25 frames. ], batch size: 389, lr: 2.39e-03, grad_scale: 32.0 2023-06-28 21:43:34,169 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2152446.0, ans=0.125 2023-06-28 21:43:57,829 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.79 vs. limit=10.0 2023-06-28 21:45:09,878 INFO [train.py:996] (0/4) Epoch 12, batch 23350, loss[loss=0.1588, simple_loss=0.237, pruned_loss=0.0403, over 21753.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.292, pruned_loss=0.06638, over 4288778.05 frames. ], batch size: 282, lr: 2.39e-03, grad_scale: 8.0 2023-06-28 21:45:18,856 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2152746.0, ans=0.1 2023-06-28 21:45:25,374 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=2152746.0, ans=0.04949747468305833 2023-06-28 21:45:33,278 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.157e+02 1.010e+03 1.481e+03 2.093e+03 4.806e+03, threshold=2.962e+03, percent-clipped=5.0 2023-06-28 21:46:14,491 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=2152926.0, ans=0.0 2023-06-28 21:46:51,355 INFO [train.py:996] (0/4) Epoch 12, batch 23400, loss[loss=0.1828, simple_loss=0.2817, pruned_loss=0.04197, over 20779.00 frames. ], tot_loss[loss=0.207, simple_loss=0.2867, pruned_loss=0.06364, over 4281316.14 frames. ], batch size: 607, lr: 2.39e-03, grad_scale: 8.0 2023-06-28 21:48:23,035 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.49 vs. limit=15.0 2023-06-28 21:48:38,219 INFO [train.py:996] (0/4) Epoch 12, batch 23450, loss[loss=0.2082, simple_loss=0.293, pruned_loss=0.06169, over 21894.00 frames. ], tot_loss[loss=0.2108, simple_loss=0.2886, pruned_loss=0.06648, over 4280480.72 frames. ], batch size: 371, lr: 2.39e-03, grad_scale: 8.0 2023-06-28 21:48:46,089 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.62 vs. limit=15.0 2023-06-28 21:48:56,418 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.004e+02 7.180e+02 1.083e+03 1.740e+03 4.594e+03, threshold=2.165e+03, percent-clipped=4.0 2023-06-28 21:48:58,726 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2153406.0, ans=0.1 2023-06-28 21:49:22,077 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2153466.0, ans=0.125 2023-06-28 21:49:51,322 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=2153526.0, ans=0.125 2023-06-28 21:50:09,369 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.10 vs. limit=15.0 2023-06-28 21:50:19,199 INFO [train.py:996] (0/4) Epoch 12, batch 23500, loss[loss=0.1817, simple_loss=0.272, pruned_loss=0.0457, over 19926.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.2867, pruned_loss=0.06727, over 4275449.04 frames. ], batch size: 702, lr: 2.39e-03, grad_scale: 8.0 2023-06-28 21:51:45,054 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2153886.0, ans=0.125 2023-06-28 21:51:46,756 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2153886.0, ans=0.0 2023-06-28 21:51:56,109 INFO [train.py:996] (0/4) Epoch 12, batch 23550, loss[loss=0.1738, simple_loss=0.2224, pruned_loss=0.0626, over 20709.00 frames. ], tot_loss[loss=0.2083, simple_loss=0.2824, pruned_loss=0.06712, over 4276472.26 frames. ], batch size: 607, lr: 2.39e-03, grad_scale: 8.0 2023-06-28 21:52:11,290 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2153946.0, ans=0.125 2023-06-28 21:52:18,944 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.122e+02 7.386e+02 1.223e+03 1.985e+03 5.110e+03, threshold=2.446e+03, percent-clipped=21.0 2023-06-28 21:52:22,914 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=2154006.0, ans=0.0 2023-06-28 21:52:31,180 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=2154006.0, ans=0.125 2023-06-28 21:52:51,731 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.95 vs. limit=22.5 2023-06-28 21:53:10,236 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.16 vs. limit=6.0 2023-06-28 21:53:11,777 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.23 vs. limit=12.0 2023-06-28 21:53:43,336 INFO [train.py:996] (0/4) Epoch 12, batch 23600, loss[loss=0.2296, simple_loss=0.3074, pruned_loss=0.07592, over 21574.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.2839, pruned_loss=0.06723, over 4269851.95 frames. ], batch size: 415, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:53:43,845 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2154246.0, ans=0.125 2023-06-28 21:53:52,926 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.89 vs. limit=15.0 2023-06-28 21:55:18,568 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=2154486.0, ans=0.0 2023-06-28 21:55:23,648 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2154486.0, ans=0.0 2023-06-28 21:55:26,537 INFO [train.py:996] (0/4) Epoch 12, batch 23650, loss[loss=0.2052, simple_loss=0.2904, pruned_loss=0.06005, over 21758.00 frames. ], tot_loss[loss=0.2085, simple_loss=0.285, pruned_loss=0.06596, over 4267792.44 frames. ], batch size: 247, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:55:34,051 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=2154546.0, ans=0.2 2023-06-28 21:55:50,248 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.631e+02 9.498e+02 1.627e+03 2.545e+03 5.743e+03, threshold=3.254e+03, percent-clipped=28.0 2023-06-28 21:57:10,351 INFO [train.py:996] (0/4) Epoch 12, batch 23700, loss[loss=0.1996, simple_loss=0.2825, pruned_loss=0.05832, over 20673.00 frames. ], tot_loss[loss=0.2091, simple_loss=0.2874, pruned_loss=0.06541, over 4271744.74 frames. ], batch size: 607, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:57:23,032 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.84 vs. limit=15.0 2023-06-28 21:58:03,618 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2154966.0, ans=0.125 2023-06-28 21:58:05,679 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=2154966.0, ans=0.025 2023-06-28 21:58:23,003 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=2155026.0, ans=15.0 2023-06-28 21:58:23,160 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.46 vs. limit=15.0 2023-06-28 21:58:44,595 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2155086.0, ans=0.125 2023-06-28 21:58:58,959 INFO [train.py:996] (0/4) Epoch 12, batch 23750, loss[loss=0.2613, simple_loss=0.3367, pruned_loss=0.09294, over 21843.00 frames. ], tot_loss[loss=0.2125, simple_loss=0.2916, pruned_loss=0.06668, over 4278462.88 frames. ], batch size: 118, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:59:05,913 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2155146.0, ans=0.125 2023-06-28 21:59:16,038 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=2155146.0, ans=0.2 2023-06-28 21:59:21,758 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.921e+02 7.434e+02 9.463e+02 1.338e+03 4.159e+03, threshold=1.893e+03, percent-clipped=3.0 2023-06-28 22:00:03,707 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=2155326.0, ans=0.0 2023-06-28 22:00:13,331 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.41 vs. limit=15.0 2023-06-28 22:00:47,763 INFO [train.py:996] (0/4) Epoch 12, batch 23800, loss[loss=0.3076, simple_loss=0.3936, pruned_loss=0.1108, over 21455.00 frames. ], tot_loss[loss=0.2098, simple_loss=0.2906, pruned_loss=0.06447, over 4275098.54 frames. ], batch size: 471, lr: 2.39e-03, grad_scale: 8.0 2023-06-28 22:01:32,187 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2155566.0, ans=0.1 2023-06-28 22:01:38,729 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2155566.0, ans=0.125 2023-06-28 22:01:45,626 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2155566.0, ans=0.0 2023-06-28 22:02:02,116 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2155626.0, ans=0.1 2023-06-28 22:02:36,535 INFO [train.py:996] (0/4) Epoch 12, batch 23850, loss[loss=0.2268, simple_loss=0.3138, pruned_loss=0.0699, over 21624.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2967, pruned_loss=0.06601, over 4273776.40 frames. ], batch size: 389, lr: 2.39e-03, grad_scale: 8.0 2023-06-28 22:03:01,488 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.378e+02 9.558e+02 1.642e+03 2.659e+03 5.260e+03, threshold=3.284e+03, percent-clipped=38.0 2023-06-28 22:04:19,074 INFO [train.py:996] (0/4) Epoch 12, batch 23900, loss[loss=0.2257, simple_loss=0.3095, pruned_loss=0.07097, over 21734.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.3037, pruned_loss=0.068, over 4275943.33 frames. ], batch size: 118, lr: 2.39e-03, grad_scale: 8.0 2023-06-28 22:05:00,770 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.61 vs. limit=15.0 2023-06-28 22:05:02,210 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.61 vs. limit=22.5 2023-06-28 22:06:00,802 INFO [train.py:996] (0/4) Epoch 12, batch 23950, loss[loss=0.1954, simple_loss=0.2623, pruned_loss=0.06427, over 21739.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2974, pruned_loss=0.0676, over 4269164.47 frames. ], batch size: 112, lr: 2.39e-03, grad_scale: 8.0 2023-06-28 22:06:02,019 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=6.98 vs. limit=22.5 2023-06-28 22:06:25,837 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.975e+02 6.890e+02 9.042e+02 1.238e+03 2.308e+03, threshold=1.808e+03, percent-clipped=0.0 2023-06-28 22:07:04,825 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2156526.0, ans=0.1 2023-06-28 22:07:11,519 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 22:07:28,053 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=2156586.0, ans=0.2 2023-06-28 22:07:48,529 INFO [train.py:996] (0/4) Epoch 12, batch 24000, loss[loss=0.2421, simple_loss=0.3303, pruned_loss=0.07696, over 21748.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.2983, pruned_loss=0.07018, over 4258245.21 frames. ], batch size: 124, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:07:48,530 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-28 22:07:58,816 INFO [zipformer.py:1728] (0/4) name=encoder.encoders.4.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([2.5237, 4.0059, 3.7147, 2.6292], device='cuda:0') 2023-06-28 22:08:05,132 INFO [train.py:1028] (0/4) Epoch 12, validation: loss=0.264, simple_loss=0.3553, pruned_loss=0.08634, over 1796401.00 frames. 2023-06-28 22:08:05,133 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23714MB 2023-06-28 22:08:06,531 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.75 vs. limit=10.0 2023-06-28 22:09:20,302 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2156826.0, ans=0.1 2023-06-28 22:09:29,227 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.20 vs. limit=12.0 2023-06-28 22:09:32,369 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.69 vs. limit=10.0 2023-06-28 22:09:36,764 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2156886.0, ans=0.0 2023-06-28 22:09:41,554 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2156886.0, ans=0.125 2023-06-28 22:09:49,049 INFO [train.py:996] (0/4) Epoch 12, batch 24050, loss[loss=0.1814, simple_loss=0.2727, pruned_loss=0.04505, over 21688.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2978, pruned_loss=0.06955, over 4257878.70 frames. ], batch size: 230, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:10:03,189 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=2156946.0, ans=0.2 2023-06-28 22:10:13,288 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2157006.0, ans=0.0 2023-06-28 22:10:14,181 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.391e+02 8.286e+02 1.353e+03 2.052e+03 4.335e+03, threshold=2.707e+03, percent-clipped=33.0 2023-06-28 22:10:42,656 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=2157066.0, ans=0.95 2023-06-28 22:11:18,840 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2157186.0, ans=0.0 2023-06-28 22:11:22,143 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2157186.0, ans=0.125 2023-06-28 22:11:25,893 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.78 vs. limit=15.0 2023-06-28 22:11:31,533 INFO [train.py:996] (0/4) Epoch 12, batch 24100, loss[loss=0.2198, simple_loss=0.2989, pruned_loss=0.07033, over 21419.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2978, pruned_loss=0.06826, over 4255102.33 frames. ], batch size: 131, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:12:02,426 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=2157306.0, ans=0.0 2023-06-28 22:12:30,212 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2157366.0, ans=0.0 2023-06-28 22:12:58,200 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=2157486.0, ans=0.025 2023-06-28 22:13:01,823 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2157486.0, ans=0.125 2023-06-28 22:13:13,309 INFO [train.py:996] (0/4) Epoch 12, batch 24150, loss[loss=0.2078, simple_loss=0.2694, pruned_loss=0.0731, over 21499.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.298, pruned_loss=0.06991, over 4268583.36 frames. ], batch size: 194, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:13:24,041 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=2157546.0, ans=0.125 2023-06-28 22:13:43,011 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.450e+02 8.470e+02 1.133e+03 1.588e+03 3.416e+03, threshold=2.267e+03, percent-clipped=5.0 2023-06-28 22:14:43,647 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=2157786.0, ans=0.125 2023-06-28 22:14:49,611 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.01 vs. limit=15.0 2023-06-28 22:14:56,854 INFO [train.py:996] (0/4) Epoch 12, batch 24200, loss[loss=0.2683, simple_loss=0.3532, pruned_loss=0.09175, over 21624.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.3014, pruned_loss=0.07206, over 4270095.43 frames. ], batch size: 441, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:15:15,082 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=2157846.0, ans=0.125 2023-06-28 22:15:48,429 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=2157966.0, ans=0.125 2023-06-28 22:16:02,022 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=2158026.0, ans=0.2 2023-06-28 22:16:11,446 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.65 vs. limit=6.0 2023-06-28 22:16:47,978 INFO [train.py:996] (0/4) Epoch 12, batch 24250, loss[loss=0.1746, simple_loss=0.2657, pruned_loss=0.04171, over 21235.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2987, pruned_loss=0.06724, over 4274978.55 frames. ], batch size: 159, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:16:52,314 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2158146.0, ans=0.125 2023-06-28 22:17:11,303 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.36 vs. limit=15.0 2023-06-28 22:17:17,818 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.733e+02 8.184e+02 1.120e+03 1.541e+03 3.593e+03, threshold=2.240e+03, percent-clipped=10.0 2023-06-28 22:17:20,569 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=12.96 vs. limit=15.0 2023-06-28 22:17:36,874 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2158266.0, ans=0.125 2023-06-28 22:17:55,190 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2158326.0, ans=0.125 2023-06-28 22:18:03,650 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=2158386.0, ans=0.0 2023-06-28 22:18:05,373 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 22:18:31,292 INFO [train.py:996] (0/4) Epoch 12, batch 24300, loss[loss=0.1584, simple_loss=0.2417, pruned_loss=0.03761, over 21710.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.2932, pruned_loss=0.06265, over 4272931.11 frames. ], batch size: 247, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:20:13,787 INFO [train.py:996] (0/4) Epoch 12, batch 24350, loss[loss=0.2512, simple_loss=0.3174, pruned_loss=0.09246, over 21237.00 frames. ], tot_loss[loss=0.207, simple_loss=0.2897, pruned_loss=0.06212, over 4276573.59 frames. ], batch size: 143, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:20:38,902 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.237e+02 7.403e+02 1.076e+03 1.597e+03 3.002e+03, threshold=2.153e+03, percent-clipped=3.0 2023-06-28 22:20:47,852 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2158806.0, ans=0.125 2023-06-28 22:21:09,859 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=2158926.0, ans=0.04949747468305833 2023-06-28 22:21:19,725 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2158926.0, ans=0.125 2023-06-28 22:21:48,585 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.67 vs. limit=15.0 2023-06-28 22:21:52,114 INFO [train.py:996] (0/4) Epoch 12, batch 24400, loss[loss=0.23, simple_loss=0.3072, pruned_loss=0.07641, over 21708.00 frames. ], tot_loss[loss=0.212, simple_loss=0.2938, pruned_loss=0.06513, over 4276065.41 frames. ], batch size: 351, lr: 2.39e-03, grad_scale: 32.0 2023-06-28 22:22:08,937 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2159046.0, ans=0.1 2023-06-28 22:22:26,035 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=2159106.0, ans=0.125 2023-06-28 22:23:39,915 INFO [train.py:996] (0/4) Epoch 12, batch 24450, loss[loss=0.2076, simple_loss=0.2879, pruned_loss=0.06365, over 21345.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2948, pruned_loss=0.06644, over 4273753.71 frames. ], batch size: 194, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:23:52,396 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=2159346.0, ans=0.025 2023-06-28 22:24:01,303 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.221e+02 9.735e+02 1.433e+03 2.433e+03 5.313e+03, threshold=2.865e+03, percent-clipped=29.0 2023-06-28 22:24:09,556 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.07 vs. limit=15.0 2023-06-28 22:24:13,715 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=2159466.0, ans=0.0 2023-06-28 22:24:44,902 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2159526.0, ans=0.125 2023-06-28 22:25:22,562 INFO [train.py:996] (0/4) Epoch 12, batch 24500, loss[loss=0.1924, simple_loss=0.2722, pruned_loss=0.05629, over 21942.00 frames. ], tot_loss[loss=0.215, simple_loss=0.2954, pruned_loss=0.06727, over 4269395.21 frames. ], batch size: 316, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:25:26,523 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2159646.0, ans=0.1 2023-06-28 22:25:29,706 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2159646.0, ans=0.125 2023-06-28 22:25:54,101 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.67 vs. limit=15.0 2023-06-28 22:26:16,339 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2159766.0, ans=0.1 2023-06-28 22:26:16,980 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.53 vs. limit=10.0 2023-06-28 22:26:27,880 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=2159826.0, ans=0.0 2023-06-28 22:27:00,442 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=2159886.0, ans=0.0 2023-06-28 22:27:03,757 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2159946.0, ans=0.125 2023-06-28 22:27:04,765 INFO [train.py:996] (0/4) Epoch 12, batch 24550, loss[loss=0.2654, simple_loss=0.3503, pruned_loss=0.09025, over 21817.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2975, pruned_loss=0.0689, over 4274265.17 frames. ], batch size: 118, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:27:07,146 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=2159946.0, ans=0.07 2023-06-28 22:27:17,326 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=2159946.0, ans=0.2 2023-06-28 22:27:18,410 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-360000.pt 2023-06-28 22:27:28,298 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.692e+02 8.632e+02 1.069e+03 1.677e+03 3.577e+03, threshold=2.139e+03, percent-clipped=6.0 2023-06-28 22:27:33,996 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2160006.0, ans=0.125 2023-06-28 22:27:55,279 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2160066.0, ans=0.0 2023-06-28 22:28:34,477 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2160186.0, ans=0.125 2023-06-28 22:28:43,090 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.14 vs. limit=10.0 2023-06-28 22:28:48,545 INFO [train.py:996] (0/4) Epoch 12, batch 24600, loss[loss=0.1728, simple_loss=0.2414, pruned_loss=0.05211, over 21583.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2944, pruned_loss=0.06869, over 4271612.71 frames. ], batch size: 247, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:29:30,919 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=2160366.0, ans=0.2 2023-06-28 22:30:18,827 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=2160486.0, ans=0.04949747468305833 2023-06-28 22:30:32,030 INFO [train.py:996] (0/4) Epoch 12, batch 24650, loss[loss=0.1998, simple_loss=0.2721, pruned_loss=0.06375, over 21800.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2875, pruned_loss=0.06737, over 4273211.91 frames. ], batch size: 118, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:30:34,150 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=2160546.0, ans=0.0 2023-06-28 22:30:35,835 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=2160546.0, ans=0.125 2023-06-28 22:30:42,415 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2160546.0, ans=0.1 2023-06-28 22:30:53,467 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.507e+02 9.210e+02 1.420e+03 2.040e+03 4.110e+03, threshold=2.841e+03, percent-clipped=23.0 2023-06-28 22:31:46,774 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.13 vs. limit=12.0 2023-06-28 22:31:51,933 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.87 vs. limit=10.0 2023-06-28 22:32:01,381 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=2160786.0, ans=0.125 2023-06-28 22:32:06,742 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.77 vs. limit=22.5 2023-06-28 22:32:13,589 INFO [train.py:996] (0/4) Epoch 12, batch 24700, loss[loss=0.206, simple_loss=0.2678, pruned_loss=0.0721, over 21492.00 frames. ], tot_loss[loss=0.2084, simple_loss=0.2852, pruned_loss=0.06581, over 4261508.63 frames. ], batch size: 441, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:32:19,213 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.88 vs. limit=22.5 2023-06-28 22:33:31,236 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=2161026.0, ans=0.2 2023-06-28 22:33:37,708 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 22:33:54,736 INFO [train.py:996] (0/4) Epoch 12, batch 24750, loss[loss=0.1973, simple_loss=0.2657, pruned_loss=0.0645, over 22036.00 frames. ], tot_loss[loss=0.2039, simple_loss=0.2804, pruned_loss=0.06372, over 4256414.76 frames. ], batch size: 103, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:34:15,908 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.049e+02 6.504e+02 9.325e+02 1.249e+03 2.794e+03, threshold=1.865e+03, percent-clipped=0.0 2023-06-28 22:34:45,904 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.36 vs. limit=15.0 2023-06-28 22:35:00,326 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.70 vs. limit=22.5 2023-06-28 22:35:07,781 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=2161326.0, ans=0.0 2023-06-28 22:35:35,296 INFO [train.py:996] (0/4) Epoch 12, batch 24800, loss[loss=0.1977, simple_loss=0.2716, pruned_loss=0.06193, over 21472.00 frames. ], tot_loss[loss=0.201, simple_loss=0.2751, pruned_loss=0.06345, over 4268892.48 frames. ], batch size: 131, lr: 2.39e-03, grad_scale: 32.0 2023-06-28 22:35:35,780 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=2161446.0, ans=0.125 2023-06-28 22:35:53,362 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=2161506.0, ans=0.2 2023-06-28 22:35:58,186 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=2161506.0, ans=0.0 2023-06-28 22:36:02,729 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2161506.0, ans=0.1 2023-06-28 22:36:28,515 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2161566.0, ans=0.125 2023-06-28 22:36:29,009 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=2161566.0, ans=15.0 2023-06-28 22:36:35,889 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.85 vs. limit=15.0 2023-06-28 22:36:51,296 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2161626.0, ans=0.1 2023-06-28 22:37:06,958 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.53 vs. limit=15.0 2023-06-28 22:37:19,227 INFO [train.py:996] (0/4) Epoch 12, batch 24850, loss[loss=0.1969, simple_loss=0.2665, pruned_loss=0.06369, over 21672.00 frames. ], tot_loss[loss=0.2008, simple_loss=0.274, pruned_loss=0.06384, over 4272581.29 frames. ], batch size: 263, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:37:21,581 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2161746.0, ans=0.1 2023-06-28 22:37:26,783 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 22:37:42,810 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.956e+02 8.268e+02 1.225e+03 1.737e+03 3.601e+03, threshold=2.449e+03, percent-clipped=20.0 2023-06-28 22:39:01,938 INFO [train.py:996] (0/4) Epoch 12, batch 24900, loss[loss=0.2303, simple_loss=0.3071, pruned_loss=0.07678, over 21443.00 frames. ], tot_loss[loss=0.2042, simple_loss=0.2781, pruned_loss=0.06512, over 4274215.10 frames. ], batch size: 131, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:39:07,915 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2162046.0, ans=0.1 2023-06-28 22:39:12,889 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2162046.0, ans=0.1 2023-06-28 22:39:42,840 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=2162106.0, ans=0.0 2023-06-28 22:40:07,824 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=2162166.0, ans=0.035 2023-06-28 22:40:09,484 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2162166.0, ans=0.1 2023-06-28 22:40:32,070 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=2162286.0, ans=0.0 2023-06-28 22:40:40,580 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2162286.0, ans=0.125 2023-06-28 22:40:46,449 INFO [train.py:996] (0/4) Epoch 12, batch 24950, loss[loss=0.268, simple_loss=0.3375, pruned_loss=0.09924, over 21596.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.2846, pruned_loss=0.06817, over 4269322.63 frames. ], batch size: 389, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:41:00,511 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2162346.0, ans=0.125 2023-06-28 22:41:20,552 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.276e+02 8.663e+02 1.354e+03 1.983e+03 3.739e+03, threshold=2.709e+03, percent-clipped=10.0 2023-06-28 22:41:41,924 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=2162466.0, ans=0.125 2023-06-28 22:41:43,466 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=2162466.0, ans=0.0 2023-06-28 22:41:46,638 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=2162466.0, ans=0.95 2023-06-28 22:41:50,314 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2162466.0, ans=0.125 2023-06-28 22:41:51,834 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=2162526.0, ans=0.07 2023-06-28 22:42:31,485 INFO [train.py:996] (0/4) Epoch 12, batch 25000, loss[loss=0.2363, simple_loss=0.3301, pruned_loss=0.07128, over 21838.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2915, pruned_loss=0.06994, over 4275362.26 frames. ], batch size: 124, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:42:34,207 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.12 vs. limit=6.0 2023-06-28 22:43:51,591 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=2162886.0, ans=0.0 2023-06-28 22:44:12,539 INFO [train.py:996] (0/4) Epoch 12, batch 25050, loss[loss=0.1705, simple_loss=0.2448, pruned_loss=0.04814, over 21579.00 frames. ], tot_loss[loss=0.2107, simple_loss=0.2846, pruned_loss=0.06839, over 4270207.40 frames. ], batch size: 263, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:44:49,743 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.916e+02 6.443e+02 9.220e+02 1.309e+03 4.556e+03, threshold=1.844e+03, percent-clipped=4.0 2023-06-28 22:44:58,864 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.56 vs. limit=15.0 2023-06-28 22:44:59,992 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=2163066.0, ans=0.125 2023-06-28 22:45:03,122 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff3.min_abs, batch_count=2163066.0, ans=0.2 2023-06-28 22:45:10,433 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.98 vs. limit=15.0 2023-06-28 22:45:43,226 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=2163186.0, ans=0.125 2023-06-28 22:45:54,125 INFO [train.py:996] (0/4) Epoch 12, batch 25100, loss[loss=0.2167, simple_loss=0.3077, pruned_loss=0.06284, over 21665.00 frames. ], tot_loss[loss=0.2061, simple_loss=0.2783, pruned_loss=0.06689, over 4277685.29 frames. ], batch size: 332, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:46:00,381 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.77 vs. limit=22.5 2023-06-28 22:46:02,796 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=2163246.0, ans=0.0 2023-06-28 22:46:15,025 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=2163306.0, ans=0.04949747468305833 2023-06-28 22:46:52,271 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.67 vs. limit=10.0 2023-06-28 22:46:56,767 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=2163426.0, ans=0.125 2023-06-28 22:46:56,861 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=2163426.0, ans=0.025 2023-06-28 22:47:17,958 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2163486.0, ans=0.0 2023-06-28 22:47:30,139 INFO [train.py:996] (0/4) Epoch 12, batch 25150, loss[loss=0.2237, simple_loss=0.3205, pruned_loss=0.06349, over 21661.00 frames. ], tot_loss[loss=0.2066, simple_loss=0.2827, pruned_loss=0.06532, over 4277942.83 frames. ], batch size: 441, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:47:40,887 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=2163546.0, ans=0.5 2023-06-28 22:47:54,085 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2163606.0, ans=0.125 2023-06-28 22:48:07,816 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.073e+02 7.241e+02 9.101e+02 1.469e+03 3.331e+03, threshold=1.820e+03, percent-clipped=11.0 2023-06-28 22:48:08,545 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2163606.0, ans=0.125 2023-06-28 22:48:10,280 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2163606.0, ans=0.0 2023-06-28 22:49:12,470 INFO [train.py:996] (0/4) Epoch 12, batch 25200, loss[loss=0.2113, simple_loss=0.3138, pruned_loss=0.05445, over 21612.00 frames. ], tot_loss[loss=0.2054, simple_loss=0.2829, pruned_loss=0.06396, over 4267668.01 frames. ], batch size: 389, lr: 2.39e-03, grad_scale: 32.0 2023-06-28 22:49:30,698 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=2163846.0, ans=0.1 2023-06-28 22:49:37,818 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2163906.0, ans=0.1 2023-06-28 22:49:43,917 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2163906.0, ans=0.0 2023-06-28 22:49:43,963 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 22:50:02,408 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=2163966.0, ans=0.125 2023-06-28 22:50:54,702 INFO [train.py:996] (0/4) Epoch 12, batch 25250, loss[loss=0.1959, simple_loss=0.2531, pruned_loss=0.06934, over 20233.00 frames. ], tot_loss[loss=0.2022, simple_loss=0.2801, pruned_loss=0.06218, over 4257638.49 frames. ], batch size: 703, lr: 2.38e-03, grad_scale: 32.0 2023-06-28 22:51:33,359 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.757e+02 8.180e+02 1.142e+03 1.720e+03 2.915e+03, threshold=2.285e+03, percent-clipped=21.0 2023-06-28 22:52:04,883 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=2164326.0, ans=0.125 2023-06-28 22:52:08,473 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=2164326.0, ans=0.0 2023-06-28 22:52:13,310 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2164326.0, ans=0.0 2023-06-28 22:52:25,424 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=2164386.0, ans=0.0 2023-06-28 22:52:31,950 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=2164386.0, ans=10.0 2023-06-28 22:52:36,301 INFO [train.py:996] (0/4) Epoch 12, batch 25300, loss[loss=0.1621, simple_loss=0.2463, pruned_loss=0.03897, over 21639.00 frames. ], tot_loss[loss=0.1998, simple_loss=0.2767, pruned_loss=0.06142, over 4248144.16 frames. ], batch size: 263, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 22:53:18,323 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.21 vs. limit=6.0 2023-06-28 22:54:03,446 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=13.42 vs. limit=15.0 2023-06-28 22:54:12,818 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2164686.0, ans=0.1 2023-06-28 22:54:22,193 INFO [train.py:996] (0/4) Epoch 12, batch 25350, loss[loss=0.2197, simple_loss=0.3068, pruned_loss=0.06633, over 21450.00 frames. ], tot_loss[loss=0.201, simple_loss=0.2795, pruned_loss=0.06119, over 4237097.23 frames. ], batch size: 507, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 22:54:29,581 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2164746.0, ans=0.125 2023-06-28 22:54:55,962 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.352e+02 8.129e+02 1.301e+03 1.964e+03 4.138e+03, threshold=2.601e+03, percent-clipped=21.0 2023-06-28 22:54:58,617 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.56 vs. limit=15.0 2023-06-28 22:54:59,793 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 22:54:59,837 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2164806.0, ans=0.125 2023-06-28 22:55:02,929 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=2164866.0, ans=0.125 2023-06-28 22:55:45,669 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.46 vs. limit=12.0 2023-06-28 22:55:56,922 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.27 vs. limit=15.0 2023-06-28 22:55:57,308 INFO [train.py:996] (0/4) Epoch 12, batch 25400, loss[loss=0.1862, simple_loss=0.2597, pruned_loss=0.05637, over 21349.00 frames. ], tot_loss[loss=0.1973, simple_loss=0.2745, pruned_loss=0.06001, over 4241653.99 frames. ], batch size: 131, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 22:55:57,848 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2165046.0, ans=0.125 2023-06-28 22:56:09,219 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2165046.0, ans=0.0 2023-06-28 22:56:41,801 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.30 vs. limit=12.0 2023-06-28 22:57:11,034 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.65 vs. limit=12.0 2023-06-28 22:57:34,613 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=2165286.0, ans=0.0 2023-06-28 22:57:34,620 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=2165286.0, ans=0.2 2023-06-28 22:57:37,576 INFO [train.py:996] (0/4) Epoch 12, batch 25450, loss[loss=0.2103, simple_loss=0.2848, pruned_loss=0.06795, over 21470.00 frames. ], tot_loss[loss=0.2005, simple_loss=0.2763, pruned_loss=0.06239, over 4251343.56 frames. ], batch size: 548, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 22:57:49,788 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2165346.0, ans=0.125 2023-06-28 22:58:11,895 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.754e+02 9.465e+02 1.393e+03 2.029e+03 3.933e+03, threshold=2.786e+03, percent-clipped=12.0 2023-06-28 22:58:40,762 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 22:59:25,649 INFO [train.py:996] (0/4) Epoch 12, batch 25500, loss[loss=0.2042, simple_loss=0.2951, pruned_loss=0.0566, over 21792.00 frames. ], tot_loss[loss=0.1983, simple_loss=0.277, pruned_loss=0.05977, over 4253475.63 frames. ], batch size: 351, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 22:59:31,729 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.87 vs. limit=22.5 2023-06-28 22:59:42,215 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2165646.0, ans=0.0 2023-06-28 22:59:52,170 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2165706.0, ans=0.1 2023-06-28 22:59:55,410 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2165706.0, ans=0.0 2023-06-28 23:00:03,759 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2165706.0, ans=0.125 2023-06-28 23:00:21,730 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2165766.0, ans=0.0 2023-06-28 23:01:09,247 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=2165886.0, ans=0.0 2023-06-28 23:01:12,023 INFO [train.py:996] (0/4) Epoch 12, batch 25550, loss[loss=0.2112, simple_loss=0.3138, pruned_loss=0.05433, over 21737.00 frames. ], tot_loss[loss=0.2034, simple_loss=0.2849, pruned_loss=0.06091, over 4265633.50 frames. ], batch size: 351, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:01:14,338 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2165946.0, ans=0.125 2023-06-28 23:01:46,810 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.170e+02 8.424e+02 1.256e+03 1.965e+03 3.448e+03, threshold=2.512e+03, percent-clipped=4.0 2023-06-28 23:01:50,617 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 23:02:03,828 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2166066.0, ans=0.125 2023-06-28 23:02:15,236 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2166126.0, ans=0.0 2023-06-28 23:02:17,019 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=2166126.0, ans=0.125 2023-06-28 23:02:47,919 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 23:02:50,020 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.32 vs. limit=15.0 2023-06-28 23:02:58,672 INFO [train.py:996] (0/4) Epoch 12, batch 25600, loss[loss=0.2551, simple_loss=0.3285, pruned_loss=0.09087, over 21779.00 frames. ], tot_loss[loss=0.208, simple_loss=0.2899, pruned_loss=0.06309, over 4269774.96 frames. ], batch size: 441, lr: 2.38e-03, grad_scale: 32.0 2023-06-28 23:03:07,489 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2166246.0, ans=0.0 2023-06-28 23:03:37,193 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2166366.0, ans=0.0 2023-06-28 23:04:39,576 INFO [train.py:996] (0/4) Epoch 12, batch 25650, loss[loss=0.2093, simple_loss=0.2675, pruned_loss=0.07561, over 21571.00 frames. ], tot_loss[loss=0.2098, simple_loss=0.2899, pruned_loss=0.06482, over 4268375.82 frames. ], batch size: 415, lr: 2.38e-03, grad_scale: 32.0 2023-06-28 23:04:45,111 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2166546.0, ans=0.125 2023-06-28 23:04:48,419 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2166546.0, ans=0.125 2023-06-28 23:05:10,125 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.401e+02 8.404e+02 1.162e+03 1.787e+03 4.210e+03, threshold=2.325e+03, percent-clipped=7.0 2023-06-28 23:05:11,505 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.79 vs. limit=22.5 2023-06-28 23:05:20,462 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=2166666.0, ans=0.125 2023-06-28 23:05:32,148 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=2166726.0, ans=0.07 2023-06-28 23:05:59,245 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2166786.0, ans=0.125 2023-06-28 23:06:05,732 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=2166786.0, ans=0.125 2023-06-28 23:06:19,773 INFO [train.py:996] (0/4) Epoch 12, batch 25700, loss[loss=0.206, simple_loss=0.2968, pruned_loss=0.05757, over 21459.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.288, pruned_loss=0.06528, over 4271208.83 frames. ], batch size: 211, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:06:24,331 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.50 vs. limit=15.0 2023-06-28 23:06:26,293 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.08 vs. limit=6.0 2023-06-28 23:06:51,609 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2166906.0, ans=0.125 2023-06-28 23:07:08,494 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=13.55 vs. limit=15.0 2023-06-28 23:07:38,034 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2167026.0, ans=0.1 2023-06-28 23:08:07,870 INFO [train.py:996] (0/4) Epoch 12, batch 25750, loss[loss=0.1807, simple_loss=0.2423, pruned_loss=0.05952, over 21050.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2915, pruned_loss=0.06779, over 4271814.94 frames. ], batch size: 608, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:08:16,050 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.78 vs. limit=15.0 2023-06-28 23:08:18,938 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=2167146.0, ans=0.125 2023-06-28 23:08:39,821 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.094e+02 7.467e+02 1.127e+03 1.693e+03 5.779e+03, threshold=2.254e+03, percent-clipped=13.0 2023-06-28 23:08:55,933 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=15.76 vs. limit=15.0 2023-06-28 23:09:21,922 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2167326.0, ans=0.125 2023-06-28 23:09:35,193 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=2167386.0, ans=0.0 2023-06-28 23:09:36,805 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2167386.0, ans=0.0 2023-06-28 23:09:51,797 INFO [train.py:996] (0/4) Epoch 12, batch 25800, loss[loss=0.2294, simple_loss=0.3132, pruned_loss=0.07277, over 21677.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.304, pruned_loss=0.07283, over 4268506.49 frames. ], batch size: 351, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:10:00,091 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.59 vs. limit=6.0 2023-06-28 23:10:00,931 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2167446.0, ans=0.1 2023-06-28 23:10:28,408 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=2167506.0, ans=0.0 2023-06-28 23:11:08,275 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.40 vs. limit=10.0 2023-06-28 23:11:33,286 INFO [train.py:996] (0/4) Epoch 12, batch 25850, loss[loss=0.1989, simple_loss=0.2764, pruned_loss=0.06071, over 21795.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.3046, pruned_loss=0.07179, over 4269546.86 frames. ], batch size: 298, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:12:08,743 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.172e+02 7.676e+02 1.093e+03 1.751e+03 3.507e+03, threshold=2.187e+03, percent-clipped=11.0 2023-06-28 23:12:42,084 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2167926.0, ans=0.125 2023-06-28 23:12:52,275 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2167926.0, ans=0.1 2023-06-28 23:12:55,447 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2167926.0, ans=0.1 2023-06-28 23:12:59,109 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2167986.0, ans=0.125 2023-06-28 23:13:23,852 INFO [train.py:996] (0/4) Epoch 12, batch 25900, loss[loss=0.2328, simple_loss=0.3272, pruned_loss=0.06922, over 21820.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.307, pruned_loss=0.07209, over 4276499.27 frames. ], batch size: 351, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:13:43,827 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2168046.0, ans=0.125 2023-06-28 23:13:43,868 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2168046.0, ans=0.125 2023-06-28 23:14:18,809 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=2168166.0, ans=0.2 2023-06-28 23:14:42,107 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2168226.0, ans=0.125 2023-06-28 23:14:51,843 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2168286.0, ans=0.125 2023-06-28 23:14:53,815 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.50 vs. limit=15.0 2023-06-28 23:14:57,156 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.75 vs. limit=15.0 2023-06-28 23:15:07,744 INFO [train.py:996] (0/4) Epoch 12, batch 25950, loss[loss=0.2144, simple_loss=0.299, pruned_loss=0.06492, over 21594.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.3102, pruned_loss=0.07351, over 4275784.47 frames. ], batch size: 263, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:15:24,928 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 23:15:42,776 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=2168406.0, ans=0.125 2023-06-28 23:15:43,678 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.181e+02 7.724e+02 1.093e+03 1.792e+03 4.212e+03, threshold=2.186e+03, percent-clipped=19.0 2023-06-28 23:15:50,795 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2168466.0, ans=0.1 2023-06-28 23:16:45,721 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.68 vs. limit=22.5 2023-06-28 23:16:54,129 INFO [train.py:996] (0/4) Epoch 12, batch 26000, loss[loss=0.2435, simple_loss=0.3294, pruned_loss=0.0788, over 21480.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.3095, pruned_loss=0.07184, over 4274295.68 frames. ], batch size: 131, lr: 2.38e-03, grad_scale: 32.0 2023-06-28 23:16:59,815 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=2168646.0, ans=0.2 2023-06-28 23:17:36,363 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2168766.0, ans=0.125 2023-06-28 23:17:44,536 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2168766.0, ans=0.1 2023-06-28 23:17:46,423 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=2168766.0, ans=0.125 2023-06-28 23:18:36,003 INFO [train.py:996] (0/4) Epoch 12, batch 26050, loss[loss=0.2304, simple_loss=0.3043, pruned_loss=0.07827, over 21868.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.3098, pruned_loss=0.07235, over 4275548.30 frames. ], batch size: 118, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:19:08,203 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.201e+02 7.488e+02 9.436e+02 1.230e+03 3.511e+03, threshold=1.887e+03, percent-clipped=1.0 2023-06-28 23:19:51,649 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 23:19:55,264 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.55 vs. limit=15.0 2023-06-28 23:20:16,748 INFO [train.py:996] (0/4) Epoch 12, batch 26100, loss[loss=0.1933, simple_loss=0.2637, pruned_loss=0.06144, over 20963.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.3044, pruned_loss=0.07249, over 4279227.80 frames. ], batch size: 607, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:20:50,385 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=2169306.0, ans=0.2 2023-06-28 23:20:56,749 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=2169366.0, ans=0.0 2023-06-28 23:21:03,692 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2169366.0, ans=0.1 2023-06-28 23:21:22,356 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2169426.0, ans=0.0 2023-06-28 23:21:29,797 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.67 vs. limit=15.0 2023-06-28 23:21:39,164 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=2169486.0, ans=0.125 2023-06-28 23:21:49,205 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=2169486.0, ans=6.0 2023-06-28 23:22:03,444 INFO [train.py:996] (0/4) Epoch 12, batch 26150, loss[loss=0.2039, simple_loss=0.2765, pruned_loss=0.06568, over 21708.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.3013, pruned_loss=0.07254, over 4289054.00 frames. ], batch size: 230, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:22:08,943 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=2169546.0, ans=0.0 2023-06-28 23:22:31,480 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.410e+02 8.718e+02 1.214e+03 1.632e+03 3.208e+03, threshold=2.428e+03, percent-clipped=15.0 2023-06-28 23:22:37,422 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2169666.0, ans=0.125 2023-06-28 23:23:01,930 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.87 vs. limit=15.0 2023-06-28 23:23:16,667 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.77 vs. limit=15.0 2023-06-28 23:23:44,679 INFO [train.py:996] (0/4) Epoch 12, batch 26200, loss[loss=0.2336, simple_loss=0.3391, pruned_loss=0.06404, over 21860.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.3022, pruned_loss=0.07103, over 4294936.86 frames. ], batch size: 371, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:24:06,784 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2169906.0, ans=0.125 2023-06-28 23:24:16,290 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2169966.0, ans=0.1 2023-06-28 23:24:25,722 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2169966.0, ans=0.0 2023-06-28 23:24:53,478 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2170026.0, ans=0.0 2023-06-28 23:25:10,812 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.69 vs. limit=15.0 2023-06-28 23:25:16,568 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=2170086.0, ans=10.0 2023-06-28 23:25:25,901 INFO [train.py:996] (0/4) Epoch 12, batch 26250, loss[loss=0.2035, simple_loss=0.3201, pruned_loss=0.04347, over 20722.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.3047, pruned_loss=0.07002, over 4293902.76 frames. ], batch size: 608, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:25:52,909 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.372e+02 9.468e+02 1.366e+03 2.102e+03 4.403e+03, threshold=2.732e+03, percent-clipped=13.0 2023-06-28 23:26:18,507 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.68 vs. limit=15.0 2023-06-28 23:27:01,314 INFO [train.py:996] (0/4) Epoch 12, batch 26300, loss[loss=0.2086, simple_loss=0.2732, pruned_loss=0.07202, over 21556.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.3018, pruned_loss=0.07002, over 4298620.37 frames. ], batch size: 548, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:27:19,963 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2170506.0, ans=0.125 2023-06-28 23:27:23,336 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2170506.0, ans=0.0 2023-06-28 23:27:38,966 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2170506.0, ans=0.0 2023-06-28 23:27:42,179 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 23:27:57,341 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.61 vs. limit=10.0 2023-06-28 23:27:59,144 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.67 vs. limit=15.0 2023-06-28 23:28:42,473 INFO [train.py:996] (0/4) Epoch 12, batch 26350, loss[loss=0.2354, simple_loss=0.3063, pruned_loss=0.08227, over 21451.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.3009, pruned_loss=0.07109, over 4292452.22 frames. ], batch size: 211, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:28:44,683 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2170746.0, ans=0.125 2023-06-28 23:29:19,232 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.451e+02 7.926e+02 1.139e+03 2.111e+03 4.700e+03, threshold=2.277e+03, percent-clipped=11.0 2023-06-28 23:29:51,104 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=2170926.0, ans=0.2 2023-06-28 23:30:00,933 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.21 vs. limit=15.0 2023-06-28 23:30:05,800 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2170986.0, ans=0.1 2023-06-28 23:30:23,050 INFO [train.py:996] (0/4) Epoch 12, batch 26400, loss[loss=0.1839, simple_loss=0.2426, pruned_loss=0.06259, over 21256.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.2959, pruned_loss=0.07123, over 4292656.09 frames. ], batch size: 549, lr: 2.38e-03, grad_scale: 32.0 2023-06-28 23:30:38,813 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2171046.0, ans=0.125 2023-06-28 23:30:48,745 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=2171106.0, ans=0.95 2023-06-28 23:31:39,296 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=2171226.0, ans=0.125 2023-06-28 23:32:16,581 INFO [train.py:996] (0/4) Epoch 12, batch 26450, loss[loss=0.236, simple_loss=0.3245, pruned_loss=0.07378, over 21755.00 frames. ], tot_loss[loss=0.221, simple_loss=0.2981, pruned_loss=0.07192, over 4289905.51 frames. ], batch size: 282, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:32:51,626 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.901e+02 9.721e+02 1.441e+03 2.127e+03 5.226e+03, threshold=2.882e+03, percent-clipped=23.0 2023-06-28 23:32:57,315 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=2171466.0, ans=0.0 2023-06-28 23:33:02,611 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2171466.0, ans=0.125 2023-06-28 23:33:59,929 INFO [train.py:996] (0/4) Epoch 12, batch 26500, loss[loss=0.1671, simple_loss=0.2322, pruned_loss=0.05096, over 21325.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2997, pruned_loss=0.07042, over 4288798.27 frames. ], batch size: 131, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:34:09,133 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=2171646.0, ans=0.125 2023-06-28 23:34:44,232 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2171766.0, ans=0.125 2023-06-28 23:35:30,385 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2171886.0, ans=0.125 2023-06-28 23:35:47,909 INFO [train.py:996] (0/4) Epoch 12, batch 26550, loss[loss=0.1868, simple_loss=0.3022, pruned_loss=0.03564, over 20824.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2987, pruned_loss=0.06837, over 4276709.84 frames. ], batch size: 608, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:35:55,381 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2171946.0, ans=0.0 2023-06-28 23:36:23,129 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.969e+02 7.971e+02 1.184e+03 2.245e+03 4.419e+03, threshold=2.369e+03, percent-clipped=15.0 2023-06-28 23:37:20,967 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2172186.0, ans=0.125 2023-06-28 23:37:22,894 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=2172186.0, ans=0.04949747468305833 2023-06-28 23:37:28,574 INFO [train.py:996] (0/4) Epoch 12, batch 26600, loss[loss=0.1845, simple_loss=0.2712, pruned_loss=0.04891, over 21603.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.297, pruned_loss=0.0661, over 4270175.30 frames. ], batch size: 263, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:37:50,218 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=2172306.0, ans=0.05 2023-06-28 23:38:34,441 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.46 vs. limit=15.0 2023-06-28 23:38:54,017 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=2172486.0, ans=0.125 2023-06-28 23:39:00,810 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=2172486.0, ans=0.2 2023-06-28 23:39:08,237 INFO [train.py:996] (0/4) Epoch 12, batch 26650, loss[loss=0.1687, simple_loss=0.2589, pruned_loss=0.03931, over 21578.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.29, pruned_loss=0.06427, over 4258547.55 frames. ], batch size: 442, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:39:14,952 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=2172546.0, ans=10.0 2023-06-28 23:39:21,540 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2172546.0, ans=0.125 2023-06-28 23:39:29,601 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=2172606.0, ans=0.2 2023-06-28 23:39:44,513 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=2172606.0, ans=6.0 2023-06-28 23:39:46,394 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.698e+02 6.768e+02 8.885e+02 1.234e+03 3.430e+03, threshold=1.777e+03, percent-clipped=1.0 2023-06-28 23:40:12,418 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2172726.0, ans=0.125 2023-06-28 23:40:15,431 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2172726.0, ans=0.1 2023-06-28 23:40:18,988 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2172726.0, ans=0.1 2023-06-28 23:40:50,891 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=2172846.0, ans=0.125 2023-06-28 23:40:52,087 INFO [train.py:996] (0/4) Epoch 12, batch 26700, loss[loss=0.1732, simple_loss=0.2382, pruned_loss=0.05412, over 21245.00 frames. ], tot_loss[loss=0.2039, simple_loss=0.2836, pruned_loss=0.06208, over 4264103.42 frames. ], batch size: 176, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:42:06,134 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2173026.0, ans=0.125 2023-06-28 23:42:07,845 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2173026.0, ans=0.125 2023-06-28 23:42:26,447 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.00 vs. limit=15.0 2023-06-28 23:42:33,657 INFO [train.py:996] (0/4) Epoch 12, batch 26750, loss[loss=0.2044, simple_loss=0.2901, pruned_loss=0.05937, over 21421.00 frames. ], tot_loss[loss=0.2027, simple_loss=0.2835, pruned_loss=0.06093, over 4265865.24 frames. ], batch size: 211, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:43:12,973 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.416e+02 8.010e+02 1.094e+03 1.630e+03 3.819e+03, threshold=2.187e+03, percent-clipped=19.0 2023-06-28 23:43:31,954 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=2173266.0, ans=0.2 2023-06-28 23:44:04,762 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 23:44:20,462 INFO [train.py:996] (0/4) Epoch 12, batch 26800, loss[loss=0.2213, simple_loss=0.2969, pruned_loss=0.07286, over 20646.00 frames. ], tot_loss[loss=0.2094, simple_loss=0.2899, pruned_loss=0.06448, over 4262242.22 frames. ], batch size: 607, lr: 2.38e-03, grad_scale: 32.0 2023-06-28 23:44:48,759 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=2173506.0, ans=0.0 2023-06-28 23:46:05,358 INFO [train.py:996] (0/4) Epoch 12, batch 26850, loss[loss=0.211, simple_loss=0.2706, pruned_loss=0.07566, over 21124.00 frames. ], tot_loss[loss=0.2123, simple_loss=0.2914, pruned_loss=0.06661, over 4263065.58 frames. ], batch size: 159, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:46:18,894 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2173746.0, ans=0.125 2023-06-28 23:46:23,783 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=2173806.0, ans=0.125 2023-06-28 23:46:28,308 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2173806.0, ans=0.125 2023-06-28 23:46:40,818 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.776e+02 8.023e+02 1.160e+03 1.579e+03 4.505e+03, threshold=2.321e+03, percent-clipped=8.0 2023-06-28 23:47:13,227 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2173926.0, ans=0.125 2023-06-28 23:47:33,186 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.17 vs. limit=15.0 2023-06-28 23:47:40,062 INFO [train.py:996] (0/4) Epoch 12, batch 26900, loss[loss=0.1935, simple_loss=0.2648, pruned_loss=0.06115, over 21341.00 frames. ], tot_loss[loss=0.2074, simple_loss=0.2831, pruned_loss=0.06583, over 4264826.87 frames. ], batch size: 131, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:48:45,908 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=2174226.0, ans=0.035 2023-06-28 23:49:16,273 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2174286.0, ans=0.0 2023-06-28 23:49:16,296 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2174286.0, ans=0.1 2023-06-28 23:49:16,308 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2174286.0, ans=0.125 2023-06-28 23:49:19,001 INFO [train.py:996] (0/4) Epoch 12, batch 26950, loss[loss=0.316, simple_loss=0.3814, pruned_loss=0.1253, over 21477.00 frames. ], tot_loss[loss=0.2082, simple_loss=0.2833, pruned_loss=0.06651, over 4265258.99 frames. ], batch size: 508, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:49:49,105 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2174406.0, ans=0.125 2023-06-28 23:49:54,865 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.876e+02 6.986e+02 1.003e+03 1.529e+03 4.492e+03, threshold=2.006e+03, percent-clipped=11.0 2023-06-28 23:50:18,207 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=2174466.0, ans=0.125 2023-06-28 23:50:31,990 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=2174526.0, ans=0.125 2023-06-28 23:50:50,044 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2174586.0, ans=0.125 2023-06-28 23:51:06,159 INFO [train.py:996] (0/4) Epoch 12, batch 27000, loss[loss=0.1913, simple_loss=0.2982, pruned_loss=0.04215, over 21177.00 frames. ], tot_loss[loss=0.2069, simple_loss=0.2845, pruned_loss=0.06464, over 4270078.28 frames. ], batch size: 548, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:51:06,161 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-28 23:51:22,021 INFO [train.py:1028] (0/4) Epoch 12, validation: loss=0.2512, simple_loss=0.3387, pruned_loss=0.08188, over 1796401.00 frames. 2023-06-28 23:51:22,022 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23714MB 2023-06-28 23:51:56,465 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.63 vs. limit=10.0 2023-06-28 23:51:58,470 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.96 vs. limit=22.5 2023-06-28 23:52:08,542 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.95 vs. limit=15.0 2023-06-28 23:52:31,805 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=2174826.0, ans=0.125 2023-06-28 23:52:54,861 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=2174886.0, ans=0.125 2023-06-28 23:53:03,873 INFO [train.py:996] (0/4) Epoch 12, batch 27050, loss[loss=0.2053, simple_loss=0.2922, pruned_loss=0.05916, over 21803.00 frames. ], tot_loss[loss=0.2055, simple_loss=0.2868, pruned_loss=0.06212, over 4274297.83 frames. ], batch size: 298, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:53:07,631 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2174946.0, ans=0.0 2023-06-28 23:53:44,983 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.952e+02 1.010e+03 1.463e+03 2.409e+03 4.686e+03, threshold=2.925e+03, percent-clipped=39.0 2023-06-28 23:53:47,733 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.56 vs. limit=10.0 2023-06-28 23:53:49,421 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.19 vs. limit=15.0 2023-06-28 23:53:52,764 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.46 vs. limit=22.5 2023-06-28 23:54:19,875 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2175126.0, ans=0.1 2023-06-28 23:54:45,813 INFO [train.py:996] (0/4) Epoch 12, batch 27100, loss[loss=0.2164, simple_loss=0.28, pruned_loss=0.07644, over 21606.00 frames. ], tot_loss[loss=0.2067, simple_loss=0.2873, pruned_loss=0.06304, over 4283387.41 frames. ], batch size: 548, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:55:27,641 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=2175366.0, ans=0.0 2023-06-28 23:55:29,256 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=2175366.0, ans=0.0 2023-06-28 23:55:44,245 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2175366.0, ans=0.0 2023-06-28 23:56:14,468 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2175486.0, ans=0.125 2023-06-28 23:56:34,066 INFO [train.py:996] (0/4) Epoch 12, batch 27150, loss[loss=0.3339, simple_loss=0.4119, pruned_loss=0.1279, over 21514.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.3, pruned_loss=0.06646, over 4285114.53 frames. ], batch size: 508, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:56:50,007 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2175606.0, ans=0.1 2023-06-28 23:56:57,849 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=2175606.0, ans=0.125 2023-06-28 23:57:14,486 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 6.191e+02 8.496e+02 1.171e+03 1.771e+03 3.313e+03, threshold=2.341e+03, percent-clipped=5.0 2023-06-28 23:57:22,345 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.33 vs. limit=15.0 2023-06-28 23:57:34,745 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=2175726.0, ans=0.0 2023-06-28 23:57:44,555 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=2175726.0, ans=0.0 2023-06-28 23:58:07,734 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2175786.0, ans=0.125 2023-06-28 23:58:15,369 INFO [train.py:996] (0/4) Epoch 12, batch 27200, loss[loss=0.2514, simple_loss=0.3417, pruned_loss=0.08058, over 21327.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.3084, pruned_loss=0.0685, over 4283575.94 frames. ], batch size: 548, lr: 2.38e-03, grad_scale: 32.0 2023-06-28 23:58:45,997 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.72 vs. limit=15.0 2023-06-28 23:58:46,838 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=2175906.0, ans=0.125 2023-06-28 23:59:13,069 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=2175966.0, ans=0.0 2023-06-28 23:59:48,356 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.09 vs. limit=15.0 2023-06-29 00:00:01,696 INFO [train.py:996] (0/4) Epoch 12, batch 27250, loss[loss=0.2437, simple_loss=0.3158, pruned_loss=0.08579, over 21947.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.31, pruned_loss=0.07157, over 4279086.41 frames. ], batch size: 372, lr: 2.38e-03, grad_scale: 16.0 2023-06-29 00:00:03,808 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2176146.0, ans=0.1 2023-06-29 00:00:03,907 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2176146.0, ans=0.125 2023-06-29 00:00:25,948 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2176206.0, ans=0.1 2023-06-29 00:00:45,294 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.531e+02 9.436e+02 1.424e+03 2.260e+03 4.305e+03, threshold=2.849e+03, percent-clipped=22.0 2023-06-29 00:01:03,351 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.31 vs. limit=15.0 2023-06-29 00:01:09,496 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2176326.0, ans=0.125 2023-06-29 00:01:24,202 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-29 00:01:30,838 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2176386.0, ans=0.0 2023-06-29 00:01:49,813 INFO [train.py:996] (0/4) Epoch 12, batch 27300, loss[loss=0.2303, simple_loss=0.3133, pruned_loss=0.07368, over 21233.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3108, pruned_loss=0.07237, over 4283252.00 frames. ], batch size: 159, lr: 2.38e-03, grad_scale: 16.0 2023-06-29 00:02:06,910 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2176446.0, ans=0.125 2023-06-29 00:02:44,484 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.29 vs. limit=6.0 2023-06-29 00:02:55,009 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=2176626.0, ans=0.125 2023-06-29 00:03:23,910 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=2176686.0, ans=0.125 2023-06-29 00:03:25,784 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2176686.0, ans=0.125 2023-06-29 00:03:31,714 INFO [train.py:996] (0/4) Epoch 12, batch 27350, loss[loss=0.2103, simple_loss=0.3002, pruned_loss=0.06018, over 21814.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3128, pruned_loss=0.07291, over 4288830.37 frames. ], batch size: 351, lr: 2.38e-03, grad_scale: 16.0 2023-06-29 00:03:43,365 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2176746.0, ans=0.125 2023-06-29 00:03:49,809 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2176746.0, ans=0.0 2023-06-29 00:03:54,768 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2176806.0, ans=0.0 2023-06-29 00:04:13,494 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.353e+02 7.469e+02 1.032e+03 1.512e+03 4.171e+03, threshold=2.065e+03, percent-clipped=4.0 2023-06-29 00:04:44,017 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2176926.0, ans=0.125 2023-06-29 00:05:01,657 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=2176986.0, ans=0.04949747468305833 2023-06-29 00:05:04,713 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=2176986.0, ans=0.125 2023-06-29 00:05:10,458 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.16 vs. limit=22.5 2023-06-29 00:05:15,161 INFO [train.py:996] (0/4) Epoch 12, batch 27400, loss[loss=0.2079, simple_loss=0.2704, pruned_loss=0.07267, over 21258.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.3076, pruned_loss=0.07176, over 4281756.34 frames. ], batch size: 159, lr: 2.38e-03, grad_scale: 16.0 2023-06-29 00:05:48,913 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2177106.0, ans=0.125 2023-06-29 00:05:59,943 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2177166.0, ans=0.1 2023-06-29 00:06:14,075 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2177166.0, ans=0.1 2023-06-29 00:06:54,557 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=2177346.0, ans=0.2 2023-06-29 00:06:55,554 INFO [train.py:996] (0/4) Epoch 12, batch 27450, loss[loss=0.2147, simple_loss=0.2854, pruned_loss=0.07196, over 21071.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.3005, pruned_loss=0.0702, over 4281074.32 frames. ], batch size: 143, lr: 2.38e-03, grad_scale: 16.0 2023-06-29 00:06:56,453 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=2177346.0, ans=0.2 2023-06-29 00:07:09,308 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2177346.0, ans=0.0 2023-06-29 00:07:32,478 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.358e+02 7.847e+02 1.147e+03 1.584e+03 3.380e+03, threshold=2.294e+03, percent-clipped=11.0 2023-06-29 00:08:34,578 INFO [train.py:996] (0/4) Epoch 12, batch 27500, loss[loss=0.2, simple_loss=0.2771, pruned_loss=0.06146, over 21834.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.2996, pruned_loss=0.07087, over 4278849.63 frames. ], batch size: 282, lr: 2.38e-03, grad_scale: 16.0 2023-06-29 00:09:48,920 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=2177826.0, ans=0.0 2023-06-29 00:09:56,287 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2177886.0, ans=0.125 2023-06-29 00:10:05,568 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2177886.0, ans=0.125 2023-06-29 00:10:15,302 INFO [train.py:996] (0/4) Epoch 12, batch 27550, loss[loss=0.1991, simple_loss=0.2737, pruned_loss=0.06219, over 21780.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2958, pruned_loss=0.06751, over 4278897.99 frames. ], batch size: 351, lr: 2.38e-03, grad_scale: 16.0 2023-06-29 00:10:19,506 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.40 vs. limit=12.0 2023-06-29 00:10:31,670 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=2177946.0, ans=0.125 2023-06-29 00:10:57,005 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.117e+02 1.004e+03 1.516e+03 2.430e+03 4.785e+03, threshold=3.032e+03, percent-clipped=27.0 2023-06-29 00:11:44,133 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=2178186.0, ans=0.2 2023-06-29 00:11:54,696 INFO [train.py:996] (0/4) Epoch 12, batch 27600, loss[loss=0.1795, simple_loss=0.2541, pruned_loss=0.05242, over 21782.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2901, pruned_loss=0.06657, over 4270656.42 frames. ], batch size: 317, lr: 2.38e-03, grad_scale: 16.0 2023-06-29 00:12:12,819 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2178246.0, ans=0.0 2023-06-29 00:12:58,052 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=2178426.0, ans=0.2 2023-06-29 00:13:12,351 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-29 00:13:29,980 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=2178486.0, ans=0.125 2023-06-29 00:13:31,585 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2178486.0, ans=0.0 2023-06-29 00:13:34,124 INFO [train.py:996] (0/4) Epoch 12, batch 27650, loss[loss=0.1928, simple_loss=0.2664, pruned_loss=0.05958, over 21898.00 frames. ], tot_loss[loss=0.2079, simple_loss=0.2841, pruned_loss=0.06591, over 4271418.93 frames. ], batch size: 98, lr: 2.38e-03, grad_scale: 16.0 2023-06-29 00:14:09,608 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.whiten.whitening_limit, batch_count=2178606.0, ans=12.0 2023-06-29 00:14:17,724 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.954e+02 7.725e+02 1.101e+03 1.627e+03 3.974e+03, threshold=2.201e+03, percent-clipped=3.0 2023-06-29 00:14:21,788 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2178666.0, ans=0.0 2023-06-29 00:14:25,239 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2178666.0, ans=0.0 2023-06-29 00:14:33,264 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2178666.0, ans=0.125 2023-06-29 00:14:58,399 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2178786.0, ans=0.125 2023-06-29 00:15:15,562 INFO [train.py:996] (0/4) Epoch 12, batch 27700, loss[loss=0.2038, simple_loss=0.2822, pruned_loss=0.0627, over 21343.00 frames. ], tot_loss[loss=0.2069, simple_loss=0.2848, pruned_loss=0.06451, over 4270450.25 frames. ], batch size: 131, lr: 2.38e-03, grad_scale: 16.0 2023-06-29 00:15:16,241 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2178846.0, ans=0.125 2023-06-29 00:15:45,534 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=2178906.0, ans=0.2 2023-06-29 00:16:17,881 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=2179026.0, ans=0.2 2023-06-29 00:16:31,476 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=6.70 vs. limit=22.5 2023-06-29 00:16:48,391 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2179086.0, ans=0.125 2023-06-29 00:16:56,207 INFO [train.py:996] (0/4) Epoch 12, batch 27750, loss[loss=0.238, simple_loss=0.3062, pruned_loss=0.08491, over 21698.00 frames. ], tot_loss[loss=0.2081, simple_loss=0.2879, pruned_loss=0.06411, over 4269513.78 frames. ], batch size: 508, lr: 2.38e-03, grad_scale: 16.0 2023-06-29 00:17:38,056 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.12 vs. limit=22.5 2023-06-29 00:17:39,849 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 6.037e+02 8.775e+02 1.414e+03 2.124e+03 3.615e+03, threshold=2.828e+03, percent-clipped=21.0 2023-06-29 00:17:42,238 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=2179266.0, ans=10.0 2023-06-29 00:17:42,239 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=2179266.0, ans=0.0 2023-06-29 00:18:07,518 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2179326.0, ans=0.0 2023-06-29 00:18:09,193 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=2179326.0, ans=0.125 2023-06-29 00:18:35,463 INFO [train.py:996] (0/4) Epoch 12, batch 27800, loss[loss=0.2159, simple_loss=0.2881, pruned_loss=0.07187, over 21473.00 frames. ], tot_loss[loss=0.2065, simple_loss=0.2849, pruned_loss=0.06407, over 4273473.53 frames. ], batch size: 194, lr: 2.38e-03, grad_scale: 16.0 2023-06-29 00:18:58,787 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=7.72 vs. limit=12.0 2023-06-29 00:20:05,168 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2179686.0, ans=0.1 2023-06-29 00:20:16,292 INFO [train.py:996] (0/4) Epoch 12, batch 27850, loss[loss=0.2006, simple_loss=0.2795, pruned_loss=0.0609, over 21781.00 frames. ], tot_loss[loss=0.2069, simple_loss=0.2839, pruned_loss=0.06493, over 4284449.41 frames. ], batch size: 112, lr: 2.38e-03, grad_scale: 16.0 2023-06-29 00:20:25,530 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=2179746.0, ans=0.04949747468305833 2023-06-29 00:20:47,706 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=2179806.0, ans=0.2 2023-06-29 00:20:51,824 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.79 vs. limit=6.0 2023-06-29 00:20:53,076 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=2179806.0, ans=0.0 2023-06-29 00:21:00,415 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.968e+02 8.980e+02 1.586e+03 2.122e+03 3.865e+03, threshold=3.171e+03, percent-clipped=6.0 2023-06-29 00:21:07,252 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2179866.0, ans=0.0 2023-06-29 00:21:46,478 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.31 vs. limit=22.5 2023-06-29 00:22:03,472 INFO [train.py:996] (0/4) Epoch 12, batch 27900, loss[loss=0.2189, simple_loss=0.3391, pruned_loss=0.04931, over 19809.00 frames. ], tot_loss[loss=0.2122, simple_loss=0.2927, pruned_loss=0.06587, over 4283740.33 frames. ], batch size: 703, lr: 2.38e-03, grad_scale: 16.0 2023-06-29 00:22:07,347 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2180046.0, ans=0.0 2023-06-29 00:22:27,606 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2180106.0, ans=0.125 2023-06-29 00:22:38,837 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2180106.0, ans=0.125 2023-06-29 00:23:51,645 INFO [train.py:996] (0/4) Epoch 12, batch 27950, loss[loss=0.1944, simple_loss=0.2852, pruned_loss=0.05184, over 21774.00 frames. ], tot_loss[loss=0.2095, simple_loss=0.2932, pruned_loss=0.06287, over 4283686.45 frames. ], batch size: 247, lr: 2.38e-03, grad_scale: 16.0 2023-06-29 00:24:35,525 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.187e+02 9.154e+02 1.408e+03 1.897e+03 4.005e+03, threshold=2.816e+03, percent-clipped=4.0 2023-06-29 00:24:36,157 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2180466.0, ans=0.1 2023-06-29 00:24:40,884 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=2180466.0, ans=10.0 2023-06-29 00:24:43,427 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.27 vs. limit=8.0 2023-06-29 00:25:31,892 INFO [train.py:996] (0/4) Epoch 12, batch 28000, loss[loss=0.2029, simple_loss=0.292, pruned_loss=0.05693, over 21728.00 frames. ], tot_loss[loss=0.2072, simple_loss=0.2909, pruned_loss=0.06171, over 4283242.35 frames. ], batch size: 389, lr: 2.38e-03, grad_scale: 32.0 2023-06-29 00:25:52,816 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.94 vs. limit=15.0 2023-06-29 00:27:03,950 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=2180886.0, ans=0.0 2023-06-29 00:27:14,002 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=2180946.0, ans=0.035 2023-06-29 00:27:14,593 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.66 vs. limit=22.5 2023-06-29 00:27:15,056 INFO [train.py:996] (0/4) Epoch 12, batch 28050, loss[loss=0.2353, simple_loss=0.3221, pruned_loss=0.0743, over 21515.00 frames. ], tot_loss[loss=0.2063, simple_loss=0.288, pruned_loss=0.06235, over 4281320.64 frames. ], batch size: 471, lr: 2.38e-03, grad_scale: 16.0 2023-06-29 00:28:00,406 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.693e+02 7.704e+02 1.092e+03 1.721e+03 4.655e+03, threshold=2.185e+03, percent-clipped=4.0 2023-06-29 00:28:11,459 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.41 vs. limit=15.0 2023-06-29 00:28:19,472 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.26 vs. limit=10.0 2023-06-29 00:28:28,163 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2181126.0, ans=0.125 2023-06-29 00:28:59,168 INFO [train.py:996] (0/4) Epoch 12, batch 28100, loss[loss=0.2121, simple_loss=0.2826, pruned_loss=0.07074, over 21551.00 frames. ], tot_loss[loss=0.2068, simple_loss=0.2869, pruned_loss=0.06334, over 4276482.47 frames. ], batch size: 441, lr: 2.38e-03, grad_scale: 16.0 2023-06-29 00:30:06,847 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=2181426.0, ans=0.07 2023-06-29 00:30:16,137 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2181426.0, ans=0.0 2023-06-29 00:30:16,173 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-29 00:30:39,424 INFO [train.py:996] (0/4) Epoch 12, batch 28150, loss[loss=0.2123, simple_loss=0.2817, pruned_loss=0.07148, over 21827.00 frames. ], tot_loss[loss=0.2033, simple_loss=0.2809, pruned_loss=0.06283, over 4275911.80 frames. ], batch size: 98, lr: 2.38e-03, grad_scale: 16.0 2023-06-29 00:30:50,486 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2181546.0, ans=0.125 2023-06-29 00:30:58,401 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=2181546.0, ans=0.2 2023-06-29 00:31:20,816 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.946e+02 8.258e+02 1.413e+03 2.441e+03 4.810e+03, threshold=2.825e+03, percent-clipped=31.0 2023-06-29 00:31:56,397 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2181726.0, ans=0.0 2023-06-29 00:32:20,021 INFO [train.py:996] (0/4) Epoch 12, batch 28200, loss[loss=0.2483, simple_loss=0.3288, pruned_loss=0.08392, over 21827.00 frames. ], tot_loss[loss=0.2032, simple_loss=0.278, pruned_loss=0.06413, over 4278053.88 frames. ], batch size: 124, lr: 2.38e-03, grad_scale: 16.0 2023-06-29 00:32:51,424 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2181906.0, ans=0.0 2023-06-29 00:34:06,089 INFO [train.py:996] (0/4) Epoch 12, batch 28250, loss[loss=0.2064, simple_loss=0.2709, pruned_loss=0.07097, over 21229.00 frames. ], tot_loss[loss=0.2089, simple_loss=0.2824, pruned_loss=0.06764, over 4282869.46 frames. ], batch size: 176, lr: 2.38e-03, grad_scale: 16.0 2023-06-29 00:34:20,173 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2182146.0, ans=0.1 2023-06-29 00:34:27,425 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.53 vs. limit=15.0 2023-06-29 00:34:48,082 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.469e+02 1.185e+03 1.669e+03 2.503e+03 4.651e+03, threshold=3.338e+03, percent-clipped=13.0 2023-06-29 00:35:35,911 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2182386.0, ans=0.125 2023-06-29 00:35:48,351 INFO [train.py:996] (0/4) Epoch 12, batch 28300, loss[loss=0.1737, simple_loss=0.2269, pruned_loss=0.06022, over 20655.00 frames. ], tot_loss[loss=0.2041, simple_loss=0.2786, pruned_loss=0.06484, over 4264726.53 frames. ], batch size: 607, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 00:36:08,960 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2182506.0, ans=0.125 2023-06-29 00:36:10,882 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2182506.0, ans=0.125 2023-06-29 00:36:21,192 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.71 vs. limit=10.0 2023-06-29 00:37:29,516 INFO [train.py:996] (0/4) Epoch 12, batch 28350, loss[loss=0.1757, simple_loss=0.2715, pruned_loss=0.03992, over 21777.00 frames. ], tot_loss[loss=0.1988, simple_loss=0.2767, pruned_loss=0.06048, over 4266228.60 frames. ], batch size: 282, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 00:38:07,759 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2182866.0, ans=0.1 2023-06-29 00:38:15,052 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.438e+02 7.077e+02 1.027e+03 1.914e+03 4.296e+03, threshold=2.054e+03, percent-clipped=2.0 2023-06-29 00:38:33,603 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2182926.0, ans=0.125 2023-06-29 00:38:54,427 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2182986.0, ans=0.1 2023-06-29 00:39:10,246 INFO [train.py:996] (0/4) Epoch 12, batch 28400, loss[loss=0.1708, simple_loss=0.2365, pruned_loss=0.05253, over 21466.00 frames. ], tot_loss[loss=0.1963, simple_loss=0.2733, pruned_loss=0.05972, over 4264796.32 frames. ], batch size: 212, lr: 2.37e-03, grad_scale: 32.0 2023-06-29 00:39:21,446 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2183046.0, ans=0.0 2023-06-29 00:39:26,529 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=2183106.0, ans=0.0 2023-06-29 00:39:44,470 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2183106.0, ans=0.0 2023-06-29 00:40:21,928 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.07 vs. limit=15.0 2023-06-29 00:40:22,879 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2183226.0, ans=0.0 2023-06-29 00:40:28,536 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.51 vs. limit=15.0 2023-06-29 00:40:52,143 INFO [train.py:996] (0/4) Epoch 12, batch 28450, loss[loss=0.2534, simple_loss=0.3142, pruned_loss=0.09632, over 21584.00 frames. ], tot_loss[loss=0.2022, simple_loss=0.2784, pruned_loss=0.06304, over 4265450.14 frames. ], batch size: 471, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 00:40:56,826 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.43 vs. limit=10.0 2023-06-29 00:41:43,663 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.501e+02 7.826e+02 1.097e+03 1.608e+03 4.884e+03, threshold=2.195e+03, percent-clipped=11.0 2023-06-29 00:41:51,717 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.51 vs. limit=15.0 2023-06-29 00:41:52,376 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=2183466.0, ans=0.025 2023-06-29 00:42:10,218 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2183526.0, ans=0.125 2023-06-29 00:42:11,949 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=2183526.0, ans=0.125 2023-06-29 00:42:17,901 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.26 vs. limit=22.5 2023-06-29 00:42:22,426 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2183586.0, ans=0.125 2023-06-29 00:42:38,365 INFO [train.py:996] (0/4) Epoch 12, batch 28500, loss[loss=0.2449, simple_loss=0.3177, pruned_loss=0.08603, over 21839.00 frames. ], tot_loss[loss=0.2057, simple_loss=0.2808, pruned_loss=0.0653, over 4267938.89 frames. ], batch size: 441, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 00:43:43,882 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=2183826.0, ans=0.0 2023-06-29 00:43:58,873 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2183886.0, ans=0.125 2023-06-29 00:44:21,535 INFO [train.py:996] (0/4) Epoch 12, batch 28550, loss[loss=0.2296, simple_loss=0.2989, pruned_loss=0.08019, over 21304.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2878, pruned_loss=0.06719, over 4271814.29 frames. ], batch size: 548, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 00:44:25,940 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=2183946.0, ans=15.0 2023-06-29 00:44:29,559 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.49 vs. limit=10.0 2023-06-29 00:44:40,039 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/checkpoint-364000.pt 2023-06-29 00:45:05,093 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=2184066.0, ans=0.125 2023-06-29 00:45:09,847 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2184066.0, ans=0.0 2023-06-29 00:45:12,486 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.914e+02 9.584e+02 1.430e+03 2.076e+03 4.050e+03, threshold=2.859e+03, percent-clipped=23.0 2023-06-29 00:46:00,126 INFO [train.py:996] (0/4) Epoch 12, batch 28600, loss[loss=0.2279, simple_loss=0.3081, pruned_loss=0.07383, over 21571.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.2947, pruned_loss=0.06945, over 4273732.54 frames. ], batch size: 389, lr: 2.37e-03, grad_scale: 8.0 2023-06-29 00:46:17,440 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=2184246.0, ans=0.2 2023-06-29 00:46:39,817 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=2184306.0, ans=0.2 2023-06-29 00:46:41,253 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2184366.0, ans=0.125 2023-06-29 00:47:45,762 INFO [train.py:996] (0/4) Epoch 12, batch 28650, loss[loss=0.2129, simple_loss=0.2822, pruned_loss=0.07176, over 21179.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.289, pruned_loss=0.06856, over 4274576.58 frames. ], batch size: 143, lr: 2.37e-03, grad_scale: 8.0 2023-06-29 00:48:30,054 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.107e+02 8.338e+02 1.219e+03 1.644e+03 3.488e+03, threshold=2.437e+03, percent-clipped=4.0 2023-06-29 00:49:26,535 INFO [train.py:996] (0/4) Epoch 12, batch 28700, loss[loss=0.2074, simple_loss=0.2838, pruned_loss=0.06551, over 21607.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.2885, pruned_loss=0.06943, over 4273250.84 frames. ], batch size: 230, lr: 2.37e-03, grad_scale: 8.0 2023-06-29 00:49:49,021 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=2184906.0, ans=0.125 2023-06-29 00:50:00,526 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=2184906.0, ans=0.2 2023-06-29 00:50:33,440 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=2185026.0, ans=0.0 2023-06-29 00:50:58,661 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2185086.0, ans=0.1 2023-06-29 00:51:06,117 INFO [train.py:996] (0/4) Epoch 12, batch 28750, loss[loss=0.1844, simple_loss=0.2755, pruned_loss=0.04659, over 21629.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2894, pruned_loss=0.06976, over 4278994.93 frames. ], batch size: 263, lr: 2.37e-03, grad_scale: 8.0 2023-06-29 00:51:46,180 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2185266.0, ans=0.125 2023-06-29 00:51:50,493 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.316e+02 7.873e+02 1.096e+03 1.643e+03 3.604e+03, threshold=2.192e+03, percent-clipped=9.0 2023-06-29 00:51:57,706 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=2185266.0, ans=10.0 2023-06-29 00:52:10,870 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2185326.0, ans=0.125 2023-06-29 00:52:18,604 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=2185326.0, ans=0.035 2023-06-29 00:52:47,780 INFO [train.py:996] (0/4) Epoch 12, batch 28800, loss[loss=0.2384, simple_loss=0.317, pruned_loss=0.07995, over 21925.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2937, pruned_loss=0.07081, over 4282751.08 frames. ], batch size: 372, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 00:53:03,167 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2185506.0, ans=0.1 2023-06-29 00:53:12,942 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=2185506.0, ans=0.2 2023-06-29 00:53:53,728 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.66 vs. limit=10.0 2023-06-29 00:54:09,034 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2185686.0, ans=0.0 2023-06-29 00:54:28,452 INFO [train.py:996] (0/4) Epoch 12, batch 28850, loss[loss=0.2145, simple_loss=0.2834, pruned_loss=0.0728, over 20804.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.2953, pruned_loss=0.07197, over 4287421.90 frames. ], batch size: 607, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 00:54:35,753 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=2185746.0, ans=0.2 2023-06-29 00:54:40,600 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=2185746.0, ans=0.125 2023-06-29 00:55:10,536 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=2185866.0, ans=0.125 2023-06-29 00:55:12,470 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2185866.0, ans=0.125 2023-06-29 00:55:18,122 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.464e+02 8.066e+02 1.240e+03 1.993e+03 4.428e+03, threshold=2.479e+03, percent-clipped=20.0 2023-06-29 00:55:57,280 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.26 vs. limit=6.0 2023-06-29 00:56:11,323 INFO [train.py:996] (0/4) Epoch 12, batch 28900, loss[loss=0.2348, simple_loss=0.3132, pruned_loss=0.07823, over 20587.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.2985, pruned_loss=0.07324, over 4287183.88 frames. ], batch size: 607, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 00:57:49,382 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2186286.0, ans=0.125 2023-06-29 00:57:53,909 INFO [train.py:996] (0/4) Epoch 12, batch 28950, loss[loss=0.2201, simple_loss=0.3226, pruned_loss=0.05877, over 21691.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.2996, pruned_loss=0.07307, over 4279915.03 frames. ], batch size: 389, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 00:58:27,197 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=2186406.0, ans=0.025 2023-06-29 00:58:47,799 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.559e+02 9.001e+02 1.307e+03 1.896e+03 3.907e+03, threshold=2.614e+03, percent-clipped=14.0 2023-06-29 00:59:06,617 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2186526.0, ans=0.1 2023-06-29 00:59:10,039 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=2186526.0, ans=0.0 2023-06-29 00:59:40,847 INFO [train.py:996] (0/4) Epoch 12, batch 29000, loss[loss=0.2233, simple_loss=0.3028, pruned_loss=0.07191, over 21975.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.3028, pruned_loss=0.07213, over 4271424.40 frames. ], batch size: 317, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:00:22,587 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.84 vs. limit=15.0 2023-06-29 01:01:06,281 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=2186886.0, ans=0.2 2023-06-29 01:01:06,838 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.17 vs. limit=15.0 2023-06-29 01:01:21,626 INFO [train.py:996] (0/4) Epoch 12, batch 29050, loss[loss=0.2176, simple_loss=0.2862, pruned_loss=0.07448, over 21685.00 frames. ], tot_loss[loss=0.223, simple_loss=0.3007, pruned_loss=0.07265, over 4276270.88 frames. ], batch size: 230, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:02:14,138 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.866e+02 7.703e+02 1.025e+03 1.554e+03 4.084e+03, threshold=2.051e+03, percent-clipped=7.0 2023-06-29 01:02:18,223 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=2187066.0, ans=0.2 2023-06-29 01:02:34,870 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=2187126.0, ans=0.0 2023-06-29 01:02:34,875 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2187126.0, ans=0.125 2023-06-29 01:02:38,597 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.85 vs. limit=15.0 2023-06-29 01:02:48,000 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2187186.0, ans=0.125 2023-06-29 01:03:02,215 INFO [train.py:996] (0/4) Epoch 12, batch 29100, loss[loss=0.177, simple_loss=0.2448, pruned_loss=0.05459, over 21759.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.2931, pruned_loss=0.07014, over 4275788.37 frames. ], batch size: 124, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:03:10,746 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=2187246.0, ans=0.0 2023-06-29 01:04:11,151 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2187426.0, ans=0.0 2023-06-29 01:04:17,755 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2187486.0, ans=0.1 2023-06-29 01:04:38,538 INFO [train.py:996] (0/4) Epoch 12, batch 29150, loss[loss=0.1966, simple_loss=0.2779, pruned_loss=0.05762, over 21222.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.2912, pruned_loss=0.06808, over 4270867.92 frames. ], batch size: 159, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:05:03,177 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.72 vs. limit=15.0 2023-06-29 01:05:25,124 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=2187666.0, ans=0.0 2023-06-29 01:05:30,930 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.044e+02 8.799e+02 1.298e+03 1.831e+03 4.569e+03, threshold=2.596e+03, percent-clipped=20.0 2023-06-29 01:05:55,907 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2187786.0, ans=0.0 2023-06-29 01:05:59,217 INFO [scaling.py:1052] (0/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-29 01:06:18,429 INFO [train.py:996] (0/4) Epoch 12, batch 29200, loss[loss=0.2241, simple_loss=0.2734, pruned_loss=0.08735, over 21400.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.2865, pruned_loss=0.06763, over 4273354.02 frames. ], batch size: 508, lr: 2.37e-03, grad_scale: 32.0 2023-06-29 01:06:18,973 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=2187846.0, ans=0.2 2023-06-29 01:07:22,247 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2188026.0, ans=0.125 2023-06-29 01:07:40,998 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.03 vs. limit=15.0 2023-06-29 01:07:55,820 INFO [scaling.py:962] (0/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.40 vs. limit=8.0 2023-06-29 01:08:03,705 INFO [train.py:996] (0/4) Epoch 12, batch 29250, loss[loss=0.1997, simple_loss=0.2869, pruned_loss=0.0562, over 21548.00 frames. ], tot_loss[loss=0.2086, simple_loss=0.2854, pruned_loss=0.06592, over 4264682.71 frames. ], batch size: 195, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:08:40,561 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.02 vs. limit=15.0 2023-06-29 01:08:48,021 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2188266.0, ans=0.125 2023-06-29 01:08:52,918 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2188266.0, ans=0.125 2023-06-29 01:08:53,907 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.020e+02 6.977e+02 9.878e+02 1.357e+03 4.006e+03, threshold=1.976e+03, percent-clipped=3.0 2023-06-29 01:08:59,153 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2188266.0, ans=0.125 2023-06-29 01:09:15,621 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=2188326.0, ans=10.0 2023-06-29 01:09:25,467 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=2188386.0, ans=0.2 2023-06-29 01:09:43,868 INFO [train.py:996] (0/4) Epoch 12, batch 29300, loss[loss=0.1613, simple_loss=0.2401, pruned_loss=0.04129, over 15576.00 frames. ], tot_loss[loss=0.2088, simple_loss=0.2875, pruned_loss=0.06508, over 4263740.20 frames. ], batch size: 60, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:09:55,101 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2188446.0, ans=0.125 2023-06-29 01:10:33,029 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2188566.0, ans=0.125 2023-06-29 01:10:46,581 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2188626.0, ans=0.125 2023-06-29 01:11:21,218 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=2188686.0, ans=0.04949747468305833 2023-06-29 01:11:30,171 INFO [train.py:996] (0/4) Epoch 12, batch 29350, loss[loss=0.2063, simple_loss=0.2996, pruned_loss=0.05652, over 21222.00 frames. ], tot_loss[loss=0.2066, simple_loss=0.2838, pruned_loss=0.06464, over 4260552.02 frames. ], batch size: 549, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:11:44,290 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=2188746.0, ans=0.07 2023-06-29 01:12:16,937 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.695e+02 7.385e+02 1.115e+03 1.625e+03 3.431e+03, threshold=2.230e+03, percent-clipped=15.0 2023-06-29 01:12:21,660 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.33 vs. limit=15.0 2023-06-29 01:12:31,455 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.50 vs. limit=15.0 2023-06-29 01:13:11,566 INFO [train.py:996] (0/4) Epoch 12, batch 29400, loss[loss=0.2018, simple_loss=0.2874, pruned_loss=0.05812, over 21687.00 frames. ], tot_loss[loss=0.2051, simple_loss=0.2842, pruned_loss=0.063, over 4262512.68 frames. ], batch size: 391, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:13:42,069 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.74 vs. limit=15.0 2023-06-29 01:13:43,005 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2189106.0, ans=0.125 2023-06-29 01:14:01,258 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2189166.0, ans=0.0 2023-06-29 01:14:52,524 INFO [train.py:996] (0/4) Epoch 12, batch 29450, loss[loss=0.2683, simple_loss=0.3398, pruned_loss=0.09839, over 21438.00 frames. ], tot_loss[loss=0.204, simple_loss=0.283, pruned_loss=0.06254, over 4273497.94 frames. ], batch size: 471, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:15:17,737 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2189406.0, ans=0.0 2023-06-29 01:15:44,175 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.913e+02 9.244e+02 1.482e+03 2.285e+03 4.603e+03, threshold=2.964e+03, percent-clipped=27.0 2023-06-29 01:16:38,752 INFO [train.py:996] (0/4) Epoch 12, batch 29500, loss[loss=0.2104, simple_loss=0.282, pruned_loss=0.06941, over 21856.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.2873, pruned_loss=0.06568, over 4279323.80 frames. ], batch size: 107, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:17:35,272 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2189826.0, ans=0.125 2023-06-29 01:17:57,476 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2189886.0, ans=0.125 2023-06-29 01:18:18,334 INFO [train.py:996] (0/4) Epoch 12, batch 29550, loss[loss=0.2142, simple_loss=0.2911, pruned_loss=0.06864, over 21918.00 frames. ], tot_loss[loss=0.2103, simple_loss=0.2866, pruned_loss=0.06698, over 4282526.54 frames. ], batch size: 113, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:18:59,317 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2190066.0, ans=0.125 2023-06-29 01:19:05,171 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.716e+02 8.394e+02 1.189e+03 1.876e+03 3.636e+03, threshold=2.379e+03, percent-clipped=6.0 2023-06-29 01:19:17,296 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=2190126.0, ans=0.0 2023-06-29 01:20:00,973 INFO [train.py:996] (0/4) Epoch 12, batch 29600, loss[loss=0.2438, simple_loss=0.3327, pruned_loss=0.07739, over 21720.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2927, pruned_loss=0.06914, over 4282643.06 frames. ], batch size: 298, lr: 2.37e-03, grad_scale: 32.0 2023-06-29 01:20:08,319 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2190246.0, ans=0.1 2023-06-29 01:20:11,431 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2190246.0, ans=0.125 2023-06-29 01:20:12,115 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.96 vs. limit=15.0 2023-06-29 01:20:55,209 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=2190366.0, ans=0.2 2023-06-29 01:21:04,544 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2190426.0, ans=0.0 2023-06-29 01:21:11,106 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2190426.0, ans=0.125 2023-06-29 01:21:13,133 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.94 vs. limit=22.5 2023-06-29 01:21:30,411 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=2190486.0, ans=0.0 2023-06-29 01:21:41,168 INFO [train.py:996] (0/4) Epoch 12, batch 29650, loss[loss=0.1992, simple_loss=0.2744, pruned_loss=0.06196, over 21721.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.291, pruned_loss=0.06595, over 4279185.93 frames. ], batch size: 389, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:22:27,491 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2190666.0, ans=0.1 2023-06-29 01:22:33,828 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 4.536e+02 9.770e+02 1.872e+03 2.859e+03 6.209e+03, threshold=3.743e+03, percent-clipped=35.0 2023-06-29 01:23:22,771 INFO [train.py:996] (0/4) Epoch 12, batch 29700, loss[loss=0.2442, simple_loss=0.3576, pruned_loss=0.06544, over 21725.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2926, pruned_loss=0.06642, over 4281071.42 frames. ], batch size: 298, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:23:29,529 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2190846.0, ans=0.0 2023-06-29 01:23:31,302 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2190846.0, ans=0.125 2023-06-29 01:23:31,849 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.98 vs. limit=10.0 2023-06-29 01:24:36,675 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.81 vs. limit=15.0 2023-06-29 01:24:50,039 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=2191086.0, ans=10.0 2023-06-29 01:24:53,156 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=2191086.0, ans=0.0 2023-06-29 01:25:02,358 INFO [train.py:996] (0/4) Epoch 12, batch 29750, loss[loss=0.2152, simple_loss=0.2904, pruned_loss=0.07002, over 21906.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2988, pruned_loss=0.06653, over 4282122.18 frames. ], batch size: 118, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:25:58,258 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.524e+02 7.715e+02 1.077e+03 1.535e+03 3.860e+03, threshold=2.154e+03, percent-clipped=1.0 2023-06-29 01:26:08,794 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2191326.0, ans=0.0 2023-06-29 01:26:42,172 INFO [train.py:996] (0/4) Epoch 12, batch 29800, loss[loss=0.218, simple_loss=0.2896, pruned_loss=0.07323, over 21404.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2992, pruned_loss=0.06696, over 4291161.45 frames. ], batch size: 194, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:28:20,979 INFO [train.py:996] (0/4) Epoch 12, batch 29850, loss[loss=0.2061, simple_loss=0.2805, pruned_loss=0.06581, over 21845.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2952, pruned_loss=0.06523, over 4283313.52 frames. ], batch size: 414, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:28:44,030 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=2191806.0, ans=0.2 2023-06-29 01:29:16,876 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.021e+02 7.804e+02 1.039e+03 1.669e+03 3.761e+03, threshold=2.078e+03, percent-clipped=15.0 2023-06-29 01:30:00,609 INFO [train.py:996] (0/4) Epoch 12, batch 29900, loss[loss=0.2287, simple_loss=0.3063, pruned_loss=0.07562, over 21379.00 frames. ], tot_loss[loss=0.2119, simple_loss=0.2923, pruned_loss=0.06574, over 4291456.59 frames. ], batch size: 131, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:30:51,525 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=2192166.0, ans=0.05 2023-06-29 01:31:46,203 INFO [train.py:996] (0/4) Epoch 12, batch 29950, loss[loss=0.2622, simple_loss=0.3477, pruned_loss=0.08835, over 21830.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2949, pruned_loss=0.06879, over 4283581.66 frames. ], batch size: 124, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:32:09,453 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2192406.0, ans=0.1 2023-06-29 01:32:32,599 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2192466.0, ans=0.1 2023-06-29 01:32:38,613 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.758e+02 9.924e+02 1.385e+03 1.832e+03 3.568e+03, threshold=2.770e+03, percent-clipped=22.0 2023-06-29 01:33:09,834 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.69 vs. limit=12.0 2023-06-29 01:33:32,980 INFO [train.py:996] (0/4) Epoch 12, batch 30000, loss[loss=0.1854, simple_loss=0.2853, pruned_loss=0.04276, over 21756.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2955, pruned_loss=0.06842, over 4278317.50 frames. ], batch size: 332, lr: 2.37e-03, grad_scale: 32.0 2023-06-29 01:33:32,981 INFO [train.py:1019] (0/4) Computing validation loss 2023-06-29 01:33:51,776 INFO [train.py:1028] (0/4) Epoch 12, validation: loss=0.255, simple_loss=0.3458, pruned_loss=0.08216, over 1796401.00 frames. 2023-06-29 01:33:51,777 INFO [train.py:1029] (0/4) Maximum memory allocated so far is 23714MB 2023-06-29 01:34:33,072 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2192766.0, ans=0.125 2023-06-29 01:34:50,520 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.18 vs. limit=15.0 2023-06-29 01:35:35,151 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=2192886.0, ans=0.125 2023-06-29 01:35:35,163 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=2192886.0, ans=0.025 2023-06-29 01:35:37,106 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=2192946.0, ans=0.0 2023-06-29 01:35:38,166 INFO [train.py:996] (0/4) Epoch 12, batch 30050, loss[loss=0.2075, simple_loss=0.3007, pruned_loss=0.05714, over 21410.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2976, pruned_loss=0.06556, over 4271898.44 frames. ], batch size: 194, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:36:28,618 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2193066.0, ans=0.0 2023-06-29 01:36:30,336 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2193066.0, ans=0.0 2023-06-29 01:36:36,202 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.629e+02 9.060e+02 1.265e+03 2.367e+03 5.681e+03, threshold=2.530e+03, percent-clipped=16.0 2023-06-29 01:36:46,067 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2193126.0, ans=0.1 2023-06-29 01:36:53,029 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.41 vs. limit=15.0 2023-06-29 01:37:17,748 INFO [train.py:996] (0/4) Epoch 12, batch 30100, loss[loss=0.2485, simple_loss=0.2966, pruned_loss=0.1002, over 21325.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2981, pruned_loss=0.06609, over 4265615.27 frames. ], batch size: 507, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:37:36,218 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.83 vs. limit=6.0 2023-06-29 01:38:11,987 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.33 vs. limit=12.0 2023-06-29 01:38:23,279 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=2193426.0, ans=0.125 2023-06-29 01:39:04,268 INFO [train.py:996] (0/4) Epoch 12, batch 30150, loss[loss=0.2625, simple_loss=0.3211, pruned_loss=0.1019, over 21412.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2947, pruned_loss=0.06778, over 4265586.27 frames. ], batch size: 471, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:39:16,795 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=2193546.0, ans=0.07 2023-06-29 01:39:29,147 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2193606.0, ans=0.125 2023-06-29 01:39:44,111 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=2193606.0, ans=0.5 2023-06-29 01:40:05,026 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.363e+02 8.482e+02 1.272e+03 2.081e+03 3.656e+03, threshold=2.544e+03, percent-clipped=13.0 2023-06-29 01:40:38,532 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2193786.0, ans=0.125 2023-06-29 01:40:47,524 INFO [train.py:996] (0/4) Epoch 12, batch 30200, loss[loss=0.2214, simple_loss=0.3223, pruned_loss=0.06023, over 21611.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.2964, pruned_loss=0.06673, over 4269156.60 frames. ], batch size: 389, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:40:56,708 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.61 vs. limit=15.0 2023-06-29 01:41:53,163 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.92 vs. limit=15.0 2023-06-29 01:42:38,874 INFO [train.py:996] (0/4) Epoch 12, batch 30250, loss[loss=0.245, simple_loss=0.3557, pruned_loss=0.06716, over 21896.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.3033, pruned_loss=0.06846, over 4272822.41 frames. ], batch size: 317, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:42:50,981 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2194146.0, ans=0.125 2023-06-29 01:43:22,316 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=2194266.0, ans=0.04949747468305833 2023-06-29 01:43:26,171 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.73 vs. limit=22.5 2023-06-29 01:43:33,015 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.232e+02 7.981e+02 1.163e+03 1.576e+03 2.909e+03, threshold=2.325e+03, percent-clipped=5.0 2023-06-29 01:43:48,866 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.77 vs. limit=15.0 2023-06-29 01:44:21,066 INFO [train.py:996] (0/4) Epoch 12, batch 30300, loss[loss=0.19, simple_loss=0.2574, pruned_loss=0.06129, over 21791.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.301, pruned_loss=0.06865, over 4265941.19 frames. ], batch size: 317, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:44:21,655 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2194446.0, ans=0.125 2023-06-29 01:45:30,467 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2194626.0, ans=0.1 2023-06-29 01:45:35,391 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2194626.0, ans=0.125 2023-06-29 01:45:48,393 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=2194686.0, ans=0.025 2023-06-29 01:45:48,511 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2194686.0, ans=0.125 2023-06-29 01:46:09,182 INFO [train.py:996] (0/4) Epoch 12, batch 30350, loss[loss=0.2646, simple_loss=0.3636, pruned_loss=0.0828, over 21847.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.3021, pruned_loss=0.06962, over 4272280.77 frames. ], batch size: 317, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:46:12,284 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.45 vs. limit=15.0 2023-06-29 01:46:36,558 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2194806.0, ans=0.125 2023-06-29 01:46:39,864 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=2194866.0, ans=0.09899494936611666 2023-06-29 01:46:49,135 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.687e+02 9.388e+02 1.588e+03 2.178e+03 4.101e+03, threshold=3.176e+03, percent-clipped=21.0 2023-06-29 01:46:51,572 INFO [scaling.py:962] (0/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.43 vs. limit=15.0 2023-06-29 01:47:07,008 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=2194926.0, ans=0.2 2023-06-29 01:47:26,703 INFO [train.py:996] (0/4) Epoch 12, batch 30400, loss[loss=0.2039, simple_loss=0.2523, pruned_loss=0.07773, over 20335.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2968, pruned_loss=0.0689, over 4259852.14 frames. ], batch size: 703, lr: 2.37e-03, grad_scale: 32.0 2023-06-29 01:47:33,915 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2195046.0, ans=0.1 2023-06-29 01:48:24,644 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2195226.0, ans=0.0 2023-06-29 01:48:31,462 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=2195226.0, ans=0.125 2023-06-29 01:48:50,513 INFO [train.py:996] (0/4) Epoch 12, batch 30450, loss[loss=0.2521, simple_loss=0.3726, pruned_loss=0.06576, over 19804.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.2971, pruned_loss=0.06819, over 4200746.25 frames. ], batch size: 702, lr: 2.37e-03, grad_scale: 8.0 2023-06-29 01:49:07,481 INFO [scaling.py:182] (0/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=2195406.0, ans=0.0 2023-06-29 01:49:38,086 INFO [optim.py:471] (0/4) Clipping_scale=2.0, grad-norm quartiles 5.733e+02 1.475e+03 2.498e+03 5.657e+03 1.532e+04, threshold=4.997e+03, percent-clipped=41.0 2023-06-29 01:49:57,476 INFO [checkpoint.py:75] (0/4) Saving checkpoint to zipformer/exp_L_small/epoch-12.pt 2023-06-29 01:50:00,038 INFO [train.py:1249] (0/4) Done!